Title: An Unsupervised Approach to Biography Production using Wikipedia
1An Unsupervised Approach to Biography Production
using Wikipedia
- Fadi Biadsy, Julia Hirschberg, Elena Filatova
- Columbia University
- ACL 2008
2Motivation
- Biographies identify important information about
individuals - Manually producing biographies is labor-intensive
usually produced only for famous individuals - Can we produce biography automatically, about a
wider variety of people, using the web for
example?
3Overview
- Multi-document summarization (MDS) approach based
on extractive techniques to produce biographies - Automatic approach to collect training data from
Wikipedia
4System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
5System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
6Sentence Selection
- Given N documents, find the sentences that
contain a reference to the target person - ? hypothesis sentences
7System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
8Biographical-Sentence Classifier
- Train a binary classifier to identify
biographical sentences - Manually annotating a large corpus of
biographical and non-biographical information
(e.g., Zhou et al., 2004) is labor intensive - Our approach collect biographical and
non-biographical corpora automatically
9Training Data Biographical Corpus from Wikipedia
- Utilize Wikipedia biographies
- Extract 16,906 biographies from the xml version
of Wikipedia - Apply simple text processing techniques to clean
up the text
10Constructing the Biographical Corpus
- Identify the subject of each biography
- Run NYUs ACE system to tag NEs and do
coreference resolution (Grishman et al., 2005)
11Constructing the Biographical Corpus
- Replace each NE by its tag type and subtype
In September 1951, King began his doctoral
studies In theology at Boston University. In
TIMEX , PER_ Individual began TARGET_HIS
doctoral studies In theology at
ORG_Educational .
12Constructing the Biographical Corpus
- Replace each NE by its tag type and subtype
- Non-pronominal referring expression that is
coreferential with the target person is replaced
by TARGET_PER
In September 1951, King began his doctoral
studies In theology at Boston University. In
TIMEX , TARGET_PER began TARGET_HIS
doctoral studies In theology at
ORG_Educational .
13Constructing the Biographical Corpus
- Replace each NE by its tag type and subtype
- Non-pronominal referring expression that is
coreferential with the target person is replaced
by TARGET_PER - Every pronoun P that refers to the target person
is replaced by TARGET_P, where P is the pronoun
replaced
In September 1951, King began his doctoral
studies In theology at Boston University. In
TIMEX , TARGET_PER began TARGET_HIS
doctoral studies In theology at
ORG_Educational .
14Constructing the Biographical Corpus
- Replace each NE by its tag type and subtype
- Non-pronominal referring expressions that are
coreferential with the target person are replaced
by TARGET_PER - Every pronoun P that refers to the target person
is replaced by TARGET_P, where P is the pronoun
replaced - Sentences containing no reference to the target
person are removed
In September 1951, King began his doctoral
studies In theology at Boston University. In
TIMEX , TARGET_PER began TARGET_HIS
doctoral studies In theology at
ORG_Educational .
15Constructing the Non-Biographical Corpus
- English newswire articles in TDT4 used to
represent non-biographical sentences - Run NYUs ACE system on each article
- Select a PERSON NE mention at random from all NEs
in article to represent the target person - Exclude sentences with no reference to this
target - Replace referring expressions and NEs as in
biography corpus
16Biographical-Sentence Classifier
- Train a classifier on the biographical and
non-biographical corpora - Biographical corpus
- 30,002 sentences from Wikipedia
- 2,108 sentences held out for testing
- Non-Biographical corpus
- 23,424 sentences from TDT4
- 2,108 sentences held out for testing
17Biographical-Sentence Classifier
- Features
- Frequency of 1-2-3 grams of class-based/lexical,
e.g. - TARGET_PER was born
- TARGET_HER husband was
- TARGET_PER said
- Frequency of 1-2 grams of POS
- Chi-square for feature selection
18Classification Results
- Experimented with three types of classifiers
- Note Classifiers provide a confidence score for
each classified sample
Classifier Accuracy F-Meassure
SVM 87.6 0.87
M. Naïve Bayes (MNB) 84.1 0.84
C4.5 81.8 0.82
19System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
20Redundancy Removal
- Some of the sentences we select may contain the
same information - How can we minimize redundancy among the selected
sentences? - Cluster biographical sentences using single-link
nearest neighbor clustering technique based on
stem-overlap (Blair-Goldensohn et al., 2004) - Select the sentence from each cluster that
maximizes classifier confidence score
21System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
22Sentence Ordering Two Methods
- Method I Use classifier confidence score to
order the biographical sentences - Method II
- Learn from the presentation order of Wikipedia
biographies - Represent sentence positions as integers
- SVM regression with RBF kernel using features
class-based/lexical unigrams and bigrams
23System Overview
Documents with NE tags and resolved coreferences
Input
Hypothesis sentences
Biographical sentence classifier
Sentence Selection
Biographical sentences
Non-redundant biographical sentences
Sentence ordering
Redundancy Removal
Reference Rewriting and Trimming
Biography
24Reference Rewriting
- News articles typically provide important
biographical information of individuals when they
are first mentioned - NYUs ACE system tags full noun phrases including
appositives as part of NEs - Example
- ltMention typePER subtypeIndividualgt
- Brian Jones, the co-pilot on the Breitling
mission, - lt/Mentiongt...
25Reference Rewriting
- Search for sentence containing longest NE that
includes target persons full name and is
coreferential with the target person (LONG-NE)
26Example Brian Jones biography
- The following sentence contains LONG-NE
- Brian Jones, the co-pilot on the Breitling
mission, would have remained a quiet but crucial
part of his team's ballooning effort if it had
not been for a series of lucky breaks. -
27Reference Rewriting
- Search for sentence containing longest NE that
includes target persons full name and is
coreferential with the target person (LONG-NE) - If this sentence was classified as biographical ?
boost rank in the biography to first
28Example Brian Jones biography
- The following sentence contains LONG-NE
- Brian Jones, the co-pilot on the Breitling
mission, would have remained a quiet but crucial
part of his team's ballooning effort if it had
not been for a series of lucky breaks. -
This sentence was NOT classified as biographical
29Reference Rewriting
- Search for sentence containing longest NE that
includes target persons full name and is
coreferential with the target person (LONG-NE) - If this sentence was classified as biographical ?
boost rank in summary to first - Otherwise, replace reference to target person in
first sentence of biography by LONG-NE
30Example Brian Jones biography
- The following sentence contains LONG-NE
- Brian Jones, the co-pilot on the Breitling
mission, would have remained a quiet but crucial
part of his team's ballooning effort if it had
not been for a series of lucky breaks. - First Sentence
- Born in Bristol in 1947, Jones learned to fly at
16, dropping out of school a year later to join
the Royal Air Force. - Born in Bristol in 1947, Brian Jones, the
co-pilot on the Breitling mission, learned to fly
at 16, dropping out of school a year later to
join the Royal Air Force. -
Replace target NE by NE-LONG
31Evaluation Task
- Evaluate using DUC (Document Understanding
Conferences) 2004, Task 5 - Task Given document cluster and persons name X
- Answer the question
- Who is X?
- Summary should be no longer than 665 bytes
32Evaluation Corpus
- 50 clusters of TREC documents
- each cluster contains 10 documents on average
- We have the output summaries of the 22 systems
that participated in original competition - NIST had 4 human summaries written for each
cluster ? used ROUGE to evaluate - Our system evaluated against top-DUC2004 the
best performing of the 22 systems (according to
ROUGE-L metric)
33Comparing our best system to the DUC2004 systems
and humans using ROUGE-L score
34Comparing our systems to top-DUC2004
- Order I Classifier confidence score order
- Order II SVM regression order
35Automatic Evaluation using ROUGE-L
- Order I Classifier confidence score order
- Order II SVM regression order
36Automatic Evaluation using ROUGE-L
- Order I Classifier confidence score order
- Order II SVM regression order
37Automatic Evaluation using ROUGE-L
- Order I Classifier confidence score order
- Order II SVM regression order
38Automatic Evaluation using ROUGE-L
- Order I Classifier confidence score order
- Order II SVM regression order
39Manual Evaluation (I)MNB vs. Top-DUC2004
- Subjects 3 native American English speakers
- They were presented with pairs of summaries
- TopDUC-2004
- MNB Order I
- Task Decide which summary more responsive in
form and content to the question -- or whether
both equally responsive
40Manual Evaluation (I)MNB vs. Top-DUC2004
- 85.3 (128/150) of judgments preferred one
summary over the other - 78.1 of these (100/128) preferred the summaries
produced by our system. - Majority vote in 42/50 summaries at least two
subjects made same choice. Of these, 88.1
(37/42) preferred our summaries (p 4.4e-7) - kappa statistic 0.441
41Manual Evaluation (II)MNB Order I vs. MNB
Order II
- Subjects 3 (different) native American English
speakers - They were presented with pairs of summaries
- MNB Order I
- MNB Order II
- Task Which summary has better presentation
order? - kappa 0.362
- Majority vote 61.7 (29/47) preferred summaries
with order II (not sign.)
42Conclusion
- Described a system for producing biographies from
training corpora collected automatically from
Wikipedia and TDT4 NO manual annotations
whatsoever - Embed ACE markups and coreferences in the
feature space to train our models - Sentence ordering for biographies modeled in two
ways - Rewriting heuristic used to create the final
biography - Our system significantly outperforms all systems
that participated in DUC-2004 according to
ROUGE-L scores - Sentence order of SVM regression was preferred by
our human subjects
43Future Work
- Sometimes only part of the sentence is
biographical so first, simplify hypothesis
sentences before classifying them - Cluster the sentences before learning the SVM
regression model - Experiment with more linguistically-informed
features - Use our approach on other query-focused
summarization tasks, e.g. - What is X?
- where X is organizations name or historical
event. - Wikipedia contains organizational profiles as
well as lists of events and their descriptions
44Thank You!Special thanks to Kathy McKeown
and the Speech and NLP groups at Columbia for
useful discussions
Example of an output of our system Who is Brian
Jones? Born in Bristol in 1947, Brian Jones, the
co-pilot on the Breitling mission, learned to fly
at 16, dropping out of school a year later to
join the Royal Air Force. After earning his
commercial balloon flying license, Jones became a
ballooning instructor in 1989 and was certified
as an examiner for balloon flight licenses by the
British Civil Aviation Authority. He helped
organize Breitling's most recent around-the-world
attempts, in 1997 and 1998. Jones, who is to turn
52 next week, is actually the team's third
co-pilot. After 13 years of service, he joined a
catering business and, in the 1980s,
45(No Transcript)
46TARGET_PER ( TIMEX TIMEX ) was one of
the main leaders of ORG_Commercial ? 1
47A Baptist minister, TARGET_PER became a civil
rights activist early in TARGET_HIS career ? 2
48SVM Regression select important biographical
features
- where,
- D is the set of 17K Wikipedia biographies
- n(t)d is the number of occurrences of t in d
- d
- Examples
Unigrams Bigrams
born became ORG_Educational was born TARGET_PER died TARGET_PER joined TARGET_his family
49Example Who is Brian Jones?
- Born in Bristol in 1947, Jones learned to fly at
16, dropping out of school a year later to join
the Royal Air Force. (-1.6) - After earning his commercial balloon flying
license, Jones became a ballooning instructor in
1989 and was certified as an examiner for balloon
flight licenses by the British Civil Aviation
Authority. (4.6) - He helped organize Breitling's most recent
around-the-world attempts, in 1997 and 1998.
(6.9) - Jones, who is to turn 52 next week, is actually
the team's third co-pilot. (8.2) - After 13 years of service, he joined a catering
business and, in the 1980s, set about becoming a
professional balloonist. (9.3)
50Born on June 18, 1942, JOH South African Deputy
President Thabo Mbeki grew up in poverty in
Mbewuleni village in the historic land of the
Xhosa. He was in the liberation struggle
virtually from the age of 10, when he and a
cousin sold Coke bottles to raise money to pay
their African National Congress membership fees.
Thabo Mbeki, who was elected president by the
National Assembly on Monday, was inaugurated in
an auspicious ceremony with a distinctly African
flavor at the Union Buildings in Pretoria. The
father, Govan Mbeki, spent much of his life in
jail and sent his children to relatives very
young so they couldn't come to depend on parents
who
51Example 3 sentences from a Wikipedia biography
- Martin Luther King, Jr., was born on January 15,
1929, in Atlanta, Georgia. He was the son of
Reverend Martin Luther King, Sr. and Alberta
Williams King. He had an older sister, Willie
Christine (September 11, 1927) and a younger
brother, Albert Daniel.
52Example class-based/lexical sentences
- TARGET_PER , was born on TIMEX , in
GPE_Population-Center . - TARGET_HE was the son of
- PER_Individual and PER_Individual .
- TARGET_HE had an older sister ,
PER_Individual ( TIMEX ) and a younger
brother , PER_Individual .
53Example Who is Brian Jones?
- After weeks of frustrating delays, Piccard and
Jones set off from the Swiss Alps on March 1. - Balloonists Brian Jones and Bertrand Piccard tore
their billowy craft to keep it from dragging the
gondola across the sand. - Bertrand and I were furiously trying to slash
it with our knives, just to get the extra helium
out,'' Jones said Wednesday after a ceremony to
donate the torn and tattered balloon to the
Smithsonian Institution's National Air and Space
Museum. - The prospect that the Breitling's crew, Bertrand
Piccard and Brian Jones, may succeed where all
others have failed seemed more promising with
each passing hour. - Jones, who is to turn 52 next week, is actually
the team's third co-pilot. - After 13 years of service, he joined a catering
business and, in the 1980s, set about becoming a
professional balloonist. - Born in Bristol in 1947, Jones learned to fly at
16, dropping out of school a year later to join
the Royal Air Force. - Until he moved from his position as back-up pilot
to pilot, he was an on-the-ground man,
responsible as the team's project manager for the
construction of both the balloon's gondola and
its flight systems. - Brian Jones, the co-pilot on the Breitling
mission, would have remained a quiet but crucial
part of his team's ballooning effort if it had
not been for a series of lucky breaks. - After earning his commercial balloon flying
license, Jones became a ballooning instructor in
1989 and was certified as an examiner for balloon
flight licenses by the British Civil Aviation
Authority. - He helped organize Breitling's most recent
around-the-world attempts, in 1997 and 1998. - .
54Example Who is Brian Jones?
- Jones, who is to turn 52 next week, is actually
the team's third co-pilot. - After weeks of frustrating delays, Piccard and
Jones set off from the Swiss Alps on March 1. - After 13 years of service, he joined a catering
business and, in the 1980s, set about becoming a
professional balloonist. - Bertrand and I were furiously trying to slash
it with our knives, just to get the extra helium
out,'' Jones said Wednesday after a ceremony to
donate the torn and tattered balloon to the
Smithsonian Institution's National Air and Space
Museum. - Born in Bristol in 1947, Jones learned to fly at
16, dropping out of school a year later to join
the Royal Air Force. - Brian Jones, the co-pilot on the Breitling
mission, would have remained a quiet but crucial
part of his team's ballooning effort if it had
not been for a series of lucky breaks. - Until he moved from his position as back-up pilot
to pilot, he was an on-the-ground man,
responsible as the team's project manager for the
construction of both the balloon's gondola and
its flight systems. - Balloonists Brian Jones and Bertrand Piccard tore
their billowy craft to keep it from dragging the
gondola across the sand. - After earning his commercial balloon flying
license, Jones became a ballooning instructor in
1989 and was certified as an examiner for balloon
flight licenses by the British Civil Aviation
Authority. - The prospect that the Breitling's crew, Bertrand
Piccard and Brian Jones, may succeed where all
others have failed seemed more promising with
each passing hour. - He helped organize Breitling's most recent
around-the-world attempts, in 1997 and 1998. - .
55Related Work
- Blair-Goldensohn et al. (2004a)s DefScriber,
which treats Who is X? as a definition question
and targets definitional themes (e.g. genus,
species) found in the input document collections
which include references to the target person.
Fadi Therfore biographical fact are not
considered (such as birth and death) - Duboue et al. (2003) also address the problem of
learning content selection rules for biographies.
These rules are learned from two corpora, the
first is a semi-structured corpus with lists of
biographical facts about show business
celebrities, and the other corpus contains
free-text biographies about the same celebrities.
- Weischedel et al. (2004) focuses on modeling
kernel-fact features typical for biographies
using linguistic and semantic processing.
Linguistic features are derived from
predicate-argument structures deduced from parse
trees, and semantic features are the set of
biography-related relations and events defined in
the ACE guidelines. Sentences containing these
kernel facts are ranked according to the
probability of their relevance to biographies,
using probabilities estimated from a corpus of
manually created biographies, including
Wikipedia, to estimate the conditional
distribution of relevant material given a kernel
fact and a background corpus to learn the feature
distribution. - Barzilay and Lee (2004) used a novel adaptation
of algorithms for Hidden Markov Models to learn
how to structure summaries in a specific domain
by analyzing topic shifts within a corpus in that
domain.