An Unsupervised Approach to Biography Production using Wikipedia PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: An Unsupervised Approach to Biography Production using Wikipedia


1
An Unsupervised Approach to Biography Production
using Wikipedia
  • Fadi Biadsy, Julia Hirschberg, Elena Filatova
  • Columbia University
  • ACL 2008

2
Motivation
  • Biographies identify important information about
    individuals
  • Manually producing biographies is labor-intensive
    usually produced only for famous individuals
  • Can we produce biography automatically, about a
    wider variety of people, using the web for
    example?

3
Overview
  • Multi-document summarization (MDS) approach based
    on extractive techniques to produce biographies
  • Automatic approach to collect training data from
    Wikipedia

4
System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
5
System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
6
Sentence Selection
  • Given N documents, find the sentences that
    contain a reference to the target person
  • ? hypothesis sentences

7
System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
8
Biographical-Sentence Classifier
  • Train a binary classifier to identify
    biographical sentences
  • Manually annotating a large corpus of
    biographical and non-biographical information
    (e.g., Zhou et al., 2004) is labor intensive
  • Our approach collect biographical and
    non-biographical corpora automatically

9
Training Data Biographical Corpus from Wikipedia
  • Utilize Wikipedia biographies
  • Extract 16,906 biographies from the xml version
    of Wikipedia
  • Apply simple text processing techniques to clean
    up the text

10
Constructing the Biographical Corpus
  • Identify the subject of each biography
  • Run NYUs ACE system to tag NEs and do
    coreference resolution (Grishman et al., 2005)

11
Constructing the Biographical Corpus
  1. Replace each NE by its tag type and subtype

In September 1951, King began his doctoral
studies In theology at Boston University. In
TIMEX , PER_ Individual began TARGET_HIS
doctoral studies In theology at
ORG_Educational .
12
Constructing the Biographical Corpus
  1. Replace each NE by its tag type and subtype
  2. Non-pronominal referring expression that is
    coreferential with the target person is replaced
    by TARGET_PER

In September 1951, King began his doctoral
studies In theology at Boston University. In
TIMEX , TARGET_PER began TARGET_HIS
doctoral studies In theology at
ORG_Educational .
13
Constructing the Biographical Corpus
  1. Replace each NE by its tag type and subtype
  2. Non-pronominal referring expression that is
    coreferential with the target person is replaced
    by TARGET_PER
  3. Every pronoun P that refers to the target person
    is replaced by TARGET_P, where P is the pronoun
    replaced

In September 1951, King began his doctoral
studies In theology at Boston University. In
TIMEX , TARGET_PER began TARGET_HIS
doctoral studies In theology at
ORG_Educational .
14
Constructing the Biographical Corpus
  1. Replace each NE by its tag type and subtype
  2. Non-pronominal referring expressions that are
    coreferential with the target person are replaced
    by TARGET_PER
  3. Every pronoun P that refers to the target person
    is replaced by TARGET_P, where P is the pronoun
    replaced
  4. Sentences containing no reference to the target
    person are removed

In September 1951, King began his doctoral
studies In theology at Boston University. In
TIMEX , TARGET_PER began TARGET_HIS
doctoral studies In theology at
ORG_Educational .
15
Constructing the Non-Biographical Corpus
  • English newswire articles in TDT4 used to
    represent non-biographical sentences
  • Run NYUs ACE system on each article
  • Select a PERSON NE mention at random from all NEs
    in article to represent the target person
  • Exclude sentences with no reference to this
    target
  • Replace referring expressions and NEs as in
    biography corpus

16
Biographical-Sentence Classifier
  • Train a classifier on the biographical and
    non-biographical corpora
  • Biographical corpus
  • 30,002 sentences from Wikipedia
  • 2,108 sentences held out for testing
  • Non-Biographical corpus
  • 23,424 sentences from TDT4
  • 2,108 sentences held out for testing

17
Biographical-Sentence Classifier
  • Features
  • Frequency of 1-2-3 grams of class-based/lexical,
    e.g.
  • TARGET_PER was born
  • TARGET_HER husband was
  • TARGET_PER said
  • Frequency of 1-2 grams of POS
  • Chi-square for feature selection

18
Classification Results
  • Experimented with three types of classifiers
  • Note Classifiers provide a confidence score for
    each classified sample

Classifier Accuracy F-Meassure
SVM 87.6 0.87
M. Naïve Bayes (MNB) 84.1 0.84
C4.5 81.8 0.82
19
System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
20
Redundancy Removal
  • Some of the sentences we select may contain the
    same information
  • How can we minimize redundancy among the selected
    sentences?
  • Cluster biographical sentences using single-link
    nearest neighbor clustering technique based on
    stem-overlap (Blair-Goldensohn et al., 2004)
  • Select the sentence from each cluster that
    maximizes classifier confidence score

21
System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
22
Sentence Ordering Two Methods
  • Method I Use classifier confidence score to
    order the biographical sentences
  • Method II
  • Learn from the presentation order of Wikipedia
    biographies
  • Represent sentence positions as integers
  • SVM regression with RBF kernel using features
    class-based/lexical unigrams and bigrams

23
System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
24
Reference Rewriting
  • News articles typically provide important
    biographical information of individuals when they
    are first mentioned
  • NYUs ACE system tags full noun phrases including
    appositives as part of NEs
  • Example
  • ltMention typePER subtypeIndividualgt
  • Brian Jones, the co-pilot on the Breitling
    mission,
  • lt/Mentiongt...

25
Reference Rewriting
  • Search for sentence containing longest NE that
    includes target persons full name and is
    coreferential with the target person (LONG-NE)

26
Example Brian Jones biography
  • The following sentence contains LONG-NE
  • Brian Jones, the co-pilot on the Breitling
    mission, would have remained a quiet but crucial
    part of his team's ballooning effort if it had
    not been for a series of lucky breaks.

27
Reference Rewriting
  • Search for sentence containing longest NE that
    includes target persons full name and is
    coreferential with the target person (LONG-NE)
  • If this sentence was classified as biographical ?
    boost rank in the biography to first

28
Example Brian Jones biography
  • The following sentence contains LONG-NE
  • Brian Jones, the co-pilot on the Breitling
    mission, would have remained a quiet but crucial
    part of his team's ballooning effort if it had
    not been for a series of lucky breaks.

This sentence was NOT classified as biographical
29
Reference Rewriting
  • Search for sentence containing longest NE that
    includes target persons full name and is
    coreferential with the target person (LONG-NE)
  • If this sentence was classified as biographical ?
    boost rank in summary to first
  • Otherwise, replace reference to target person in
    first sentence of biography by LONG-NE

30
Example Brian Jones biography
  • The following sentence contains LONG-NE
  • Brian Jones, the co-pilot on the Breitling
    mission, would have remained a quiet but crucial
    part of his team's ballooning effort if it had
    not been for a series of lucky breaks.
  • First Sentence
  • Born in Bristol in 1947, Jones learned to fly at
    16, dropping out of school a year later to join
    the Royal Air Force.
  • Born in Bristol in 1947, Brian Jones, the
    co-pilot on the Breitling mission, learned to fly
    at 16, dropping out of school a year later to
    join the Royal Air Force.

Replace target NE by NE-LONG
31
Evaluation Task
  • Evaluate using DUC (Document Understanding
    Conferences) 2004, Task 5
  • Task Given document cluster and persons name X
  • Answer the question
  • Who is X?
  • Summary should be no longer than 665 bytes

32
Evaluation Corpus
  • 50 clusters of TREC documents
  • each cluster contains 10 documents on average
  • We have the output summaries of the 22 systems
    that participated in original competition
  • NIST had 4 human summaries written for each
    cluster ? used ROUGE to evaluate
  • Our system evaluated against top-DUC2004 the
    best performing of the 22 systems (according to
    ROUGE-L metric)

33
Comparing our best system to the DUC2004 systems
and humans using ROUGE-L score
34
Comparing our systems to top-DUC2004
  • Order I Classifier confidence score order
  • Order II SVM regression order

35
Automatic Evaluation using ROUGE-L
  • Order I Classifier confidence score order
  • Order II SVM regression order

36
Automatic Evaluation using ROUGE-L
  • Order I Classifier confidence score order
  • Order II SVM regression order

37
Automatic Evaluation using ROUGE-L
  • Order I Classifier confidence score order
  • Order II SVM regression order

38
Automatic Evaluation using ROUGE-L
  • Order I Classifier confidence score order
  • Order II SVM regression order

39
Manual Evaluation (I)MNB vs. Top-DUC2004
  • Subjects 3 native American English speakers
  • They were presented with pairs of summaries
  • TopDUC-2004
  • MNB Order I
  • Task Decide which summary more responsive in
    form and content to the question -- or whether
    both equally responsive

40
Manual Evaluation (I)MNB vs. Top-DUC2004
  • 85.3 (128/150) of judgments preferred one
    summary over the other
  • 78.1 of these (100/128) preferred the summaries
    produced by our system.
  • Majority vote in 42/50 summaries at least two
    subjects made same choice. Of these, 88.1
    (37/42) preferred our summaries (p 4.4e-7)
  • kappa statistic 0.441

41
Manual Evaluation (II)MNB Order I vs. MNB
Order II
  • Subjects 3 (different) native American English
    speakers
  • They were presented with pairs of summaries
  • MNB Order I
  • MNB Order II
  • Task Which summary has better presentation
    order?
  • kappa 0.362
  • Majority vote 61.7 (29/47) preferred summaries
    with order II (not sign.)

42
Conclusion
  • Described a system for producing biographies from
    training corpora collected automatically from
    Wikipedia and TDT4 NO manual annotations
    whatsoever
  • Embed ACE markups and coreferences in the
    feature space to train our models
  • Sentence ordering for biographies modeled in two
    ways
  • Rewriting heuristic used to create the final
    biography
  • Our system significantly outperforms all systems
    that participated in DUC-2004 according to
    ROUGE-L scores
  • Sentence order of SVM regression was preferred by
    our human subjects

43
Future Work
  • Sometimes only part of the sentence is
    biographical so first, simplify hypothesis
    sentences before classifying them
  • Cluster the sentences before learning the SVM
    regression model
  • Experiment with more linguistically-informed
    features
  • Use our approach on other query-focused
    summarization tasks, e.g.
  • What is X?
  • where X is organizations name or historical
    event.
  • Wikipedia contains organizational profiles as
    well as lists of events and their descriptions

44
Thank You!Special thanks to Kathy McKeown
and the Speech and NLP groups at Columbia for
useful discussions
Example of an output of our system Who is Brian
Jones? Born in Bristol in 1947, Brian Jones, the
co-pilot on the Breitling mission, learned to fly
at 16, dropping out of school a year later to
join the Royal Air Force. After earning his
commercial balloon flying license, Jones became a
ballooning instructor in 1989 and was certified
as an examiner for balloon flight licenses by the
British Civil Aviation Authority. He helped
organize Breitling's most recent around-the-world
attempts, in 1997 and 1998. Jones, who is to turn
52 next week, is actually the team's third
co-pilot. After 13 years of service, he joined a
catering business and, in the 1980s,
45
(No Transcript)
46
TARGET_PER ( TIMEX TIMEX ) was one of
the main leaders of ORG_Commercial ? 1
47
A Baptist minister, TARGET_PER became a civil
rights activist early in TARGET_HIS career ? 2
48
SVM Regression select important biographical
features
  • where,
  • D is the set of 17K Wikipedia biographies
  • n(t)d is the number of occurrences of t in d
  • d
  • Examples

Unigrams Bigrams
born became ORG_Educational was born TARGET_PER died TARGET_PER joined TARGET_his family
49
Example Who is Brian Jones?
  • Born in Bristol in 1947, Jones learned to fly at
    16, dropping out of school a year later to join
    the Royal Air Force. (-1.6)
  • After earning his commercial balloon flying
    license, Jones became a ballooning instructor in
    1989 and was certified as an examiner for balloon
    flight licenses by the British Civil Aviation
    Authority. (4.6)
  • He helped organize Breitling's most recent
    around-the-world attempts, in 1997 and 1998.
    (6.9)
  • Jones, who is to turn 52 next week, is actually
    the team's third co-pilot. (8.2)
  • After 13 years of service, he joined a catering
    business and, in the 1980s, set about becoming a
    professional balloonist. (9.3)

50
Born on June 18, 1942, JOH South African Deputy
President Thabo Mbeki grew up in poverty in
Mbewuleni village in the historic land of the
Xhosa. He was in the liberation struggle
virtually from the age of 10, when he and a
cousin sold Coke bottles to raise money to pay
their African National Congress membership fees.
Thabo Mbeki, who was elected president by the
National Assembly on Monday, was inaugurated in
an auspicious ceremony with a distinctly African
flavor at the Union Buildings in Pretoria. The
father, Govan Mbeki, spent much of his life in
jail and sent his children to relatives very
young so they couldn't come to depend on parents
who
51
Example 3 sentences from a Wikipedia biography
  • Martin Luther King, Jr., was born on January 15,
    1929, in Atlanta, Georgia. He was the son of
    Reverend Martin Luther King, Sr. and Alberta
    Williams King. He had an older sister, Willie
    Christine (September 11, 1927) and a younger
    brother, Albert Daniel.

52
Example class-based/lexical sentences
  • TARGET_PER , was born on TIMEX , in
    GPE_Population-Center .
  • TARGET_HE was the son of
  • PER_Individual and PER_Individual .
  • TARGET_HE had an older sister ,
    PER_Individual ( TIMEX ) and a younger
    brother , PER_Individual .

53
Example Who is Brian Jones?
  • After weeks of frustrating delays, Piccard and
    Jones set off from the Swiss Alps on March 1.
  • Balloonists Brian Jones and Bertrand Piccard tore
    their billowy craft to keep it from dragging the
    gondola across the sand.
  • Bertrand and I were furiously trying to slash
    it with our knives, just to get the extra helium
    out,'' Jones said Wednesday after a ceremony to
    donate the torn and tattered balloon to the
    Smithsonian Institution's National Air and Space
    Museum.
  • The prospect that the Breitling's crew, Bertrand
    Piccard and Brian Jones, may succeed where all
    others have failed seemed more promising with
    each passing hour.
  • Jones, who is to turn 52 next week, is actually
    the team's third co-pilot.
  • After 13 years of service, he joined a catering
    business and, in the 1980s, set about becoming a
    professional balloonist.
  • Born in Bristol in 1947, Jones learned to fly at
    16, dropping out of school a year later to join
    the Royal Air Force.
  • Until he moved from his position as back-up pilot
    to pilot, he was an on-the-ground man,
    responsible as the team's project manager for the
    construction of both the balloon's gondola and
    its flight systems.
  • Brian Jones, the co-pilot on the Breitling
    mission, would have remained a quiet but crucial
    part of his team's ballooning effort if it had
    not been for a series of lucky breaks.
  • After earning his commercial balloon flying
    license, Jones became a ballooning instructor in
    1989 and was certified as an examiner for balloon
    flight licenses by the British Civil Aviation
    Authority.
  • He helped organize Breitling's most recent
    around-the-world attempts, in 1997 and 1998.
  • .

54
Example Who is Brian Jones?
  • Jones, who is to turn 52 next week, is actually
    the team's third co-pilot.
  • After weeks of frustrating delays, Piccard and
    Jones set off from the Swiss Alps on March 1.
  • After 13 years of service, he joined a catering
    business and, in the 1980s, set about becoming a
    professional balloonist.
  • Bertrand and I were furiously trying to slash
    it with our knives, just to get the extra helium
    out,'' Jones said Wednesday after a ceremony to
    donate the torn and tattered balloon to the
    Smithsonian Institution's National Air and Space
    Museum.
  • Born in Bristol in 1947, Jones learned to fly at
    16, dropping out of school a year later to join
    the Royal Air Force.
  • Brian Jones, the co-pilot on the Breitling
    mission, would have remained a quiet but crucial
    part of his team's ballooning effort if it had
    not been for a series of lucky breaks.
  • Until he moved from his position as back-up pilot
    to pilot, he was an on-the-ground man,
    responsible as the team's project manager for the
    construction of both the balloon's gondola and
    its flight systems.
  • Balloonists Brian Jones and Bertrand Piccard tore
    their billowy craft to keep it from dragging the
    gondola across the sand.
  • After earning his commercial balloon flying
    license, Jones became a ballooning instructor in
    1989 and was certified as an examiner for balloon
    flight licenses by the British Civil Aviation
    Authority.
  • The prospect that the Breitling's crew, Bertrand
    Piccard and Brian Jones, may succeed where all
    others have failed seemed more promising with
    each passing hour.
  • He helped organize Breitling's most recent
    around-the-world attempts, in 1997 and 1998.
  • .

55
Related Work
  • Blair-Goldensohn et al. (2004a)s DefScriber,
    which treats Who is X? as a definition question
    and targets definitional themes (e.g. genus,
    species) found in the input document collections
    which include references to the target person.
    Fadi Therfore biographical fact are not
    considered (such as birth and death)
  • Duboue et al. (2003) also address the problem of
    learning content selection rules for biographies.
    These rules are learned from two corpora, the
    first is a semi-structured corpus with lists of
    biographical facts about show business
    celebrities, and the other corpus contains
    free-text biographies about the same celebrities.
  • Weischedel et al. (2004) focuses on modeling
    kernel-fact features typical for biographies
    using linguistic and semantic processing.
    Linguistic features are derived from
    predicate-argument structures deduced from parse
    trees, and semantic features are the set of
    biography-related relations and events defined in
    the ACE guidelines. Sentences containing these
    kernel facts are ranked according to the
    probability of their relevance to biographies,
    using probabilities estimated from a corpus of
    manually created biographies, including
    Wikipedia, to estimate the conditional
    distribution of relevant material given a kernel
    fact and a background corpus to learn the feature
    distribution.
  • Barzilay and Lee (2004) used a novel adaptation
    of algorithms for Hidden Markov Models to learn
    how to structure summaries in a specific domain
    by analyzing topic shifts within a corpus in that
    domain.
Write a Comment
User Comments (0)
About PowerShow.com