Empirical Methods in Information Extraction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Empirical Methods in Information Extraction

1
Empirical Methods in Information Extraction

- Claire Cardie
????????
? ? ?
1999. 11. 2.

2
Contents

Introduction
The Architecture of an Information Extraction
System
The Role of Corpus-Based Language Learning
Algorithms
Learning Extraction Patterns
Coreference Resolution and Template Generation
Future Directions

3
Introduction(1/2)

Information Extraction System
inherently domain specific
takes as input an unrestricted text and
summarizes the text with respect to a
prespecified topic or domain of interest. (Figure
1)
skim a text to find relevant sections and then
focus only on these sections.
MUC performance evaluation
recall
precision
applications
analyzing
terrorist activities, business joint ventures,
medical patient records,
building
KB from web pages, job listing DB from
newsgroups / web sites / advertisements, weather
forecast DB from web pages, ...

( correct slot-fillers in output template) / (
slot-fillers in answer key)
( correct slot-fillers in output template) / (
slot-fillers in output template)
4
Introduction(2/2)

Problems in todays IE systems
accuracy
the errors of an automated IE system are
due to its relative shallow understanding of the
input text
difficult to track down and to correct
portability
domain-specific nature
manually modifying and adding domain-specific
linguistic knowledge to an existing NLP system is
slow and error-prone.

We will see that empirical methods for IE are
corpus-based, machine learning algorithms.
5
The Architecture of an IE System(1/2)

Approaches to IE in the early days
traditional NLP techniques vs. keyword matching
techniques
Standard architecture for IE systems (Figure 2)
tokenization and tagging
tag each word with respect to POS and possibly
semantic class
sentence analysis
one or more stages of syntactic analysis
identify
noun/verb groups, prepositional phrases,
subjects, objects, conjunctions,
semantic entities relevant to the extraction
topic
the system need only perform partial parsing
looks for fragments of text that can be reliably
recognized
the ambiguity resolution decisions can be
postponed

6
The Architecture of an IE System(2/2)

Standard architecture for IE systems (continued)
extraction
the first entirely domain-specific component
identifies domain-specific relations among
relevant entities in the text
merging
coreference resolution, or anaphora resolution
determines whether it refers to an existing
entity or whether it is new
determine the implicit subjects of all verb
phrases
discourse-level inference
template generation
determines the number of distinct events in the
text
maps the individually extracted pieces of
information onto each event
produces output templates
the best place to apply domain-specific
constraint
some slots require set fills, or require
normalization of their fillers.

7
The Role of Corpus-Based Language Learning
Algorithms(1/3)

Q How have researchers used empirical methods in
NLP to improve the accuracy and portability of IE
systems?
A corpus-based language learning algorithms have
been used to improve individual components of the
IE system.
For language tasks that are domain-independent
and syntactic
annotated corpora already exist
POS tagging, partial parsing, WSD
the importance of WSD for IE task remains
unclear.
NL learning techniques are more difficult to
apply to subsequent stages of IE.
learning extraction patterns, coreference
resolution, template generation

8
The Role of Corpus-Based Language Learning
Algorithms(2/3)

The problems of applying empirical methods
no corpora annotated with the appropriate
semantic domain-specific supervisory
information
corpus for IE lttext, output templategt
the output templates
say nothing about which occurrence of the string
is responsible for the extraction
provide no direct means for learning patterns to
extract symbols not necessarily appearing
anywhere in the text(set fills)
the semantic domain-specific language-processing
skills require the output of earlier levels of
analysis(tagging partial parsing).
complicate to generate the training examples
whenever the behavior of these earlier modules
changes,
new training examples must be generated
the learning algorithms for later stages must be
retrained
learning algorithms must deal with noise caused
by errors from earlier components ? new
algorithms need to be developed

9
The Role of Corpus-Based Language Learning
Algorithms(3/3)

Data-driven nature of corpus-based approaches
accuracy
when the training data is derived from the same
type of texts that the IE system is to process,
the acquired language skills are automatically
tuned to that corpus, increasing the accuracy of
the system.
portability
because each NLU skill is learned automatically
rather than being manually coded,
that skill can be moved quickly from one IE
system to another by retraining the appropriate
component.

10
Learning Extraction Patterns(1/5)

The role for empirical methods in the Extraction
phase
knowledge acquisition to automate the
acquisition of good extraction patterns
AutoSlogRiloff 1993
learns extraction patterns in the form of
domain-specific concept node definitions for use
with the CIRCUS parser. (Figure 3)
learns concept node definitions via a one-shot
learning algorithm
background knowledge
a small set of general linguistic patterns
(approximately 13)
requires human feedback loop, which filters bad
extraction patterns
accuracy 98, portability 5 hours
critical step towards building IE systems that
are trainable entirely by end-users
(Figure 4)

11
Learning Extraction Patterns(2/5)
AutoSlogs Learning Algorithm
Given a noun phrase to be extracted 1. Find the
sentence from which the noun phrase
originated. 2. Present the sentence to the
partial parser for processing. 3. Apply the
linguistic patterns in order. 4. When a pattern
applies, generate a concept node definition from
the matched constituents, their context, the
concept type provided in the annotation for the
target noun phrase, and the predefined semantic
class for the filler.
ltactive-voice-verbgt followed by
lttarget-npgtltdirect objectgt
Concept ltltconceptgt of lttarget-npgtgt Trigger
ltltverbgt of ltactive-voice-verbgtgt Position
direct-object Constraints ((ltltsemantic classgt
of ltconceptgtgt)) Enabling Conditions
((active-voice))
12
Learning Extraction Patterns(3/5)

PALKAKim Moldovan 1995
background knowledge
concept hierarchy
a set of keywords that can be used to trigger
each pattern
comprises a set of generic semantic case frame
definitions for each type of information to be
extracted
semantic class lexicon
CRYSTALSoderland 1995
triggers comprise a much more detailed
specification of linguistic context
employs a covering algorithm
medical diagnosis domain
precision 50-80 , recall 45-75

13
Learning Extraction Patterns(4/5)
CRYSTALs Learning Algorithm
1. Begin by generating the most specific concept
node possible for every phrase to be extracted in
the training texts. 2. For each concept node C
2.1. Find the most similar concept node C.
2.2. Relax the constrains of each just enough to
unify C and C. 2.3. Test the new extraction
pattern P against the training corpus. If (error
rate lt threshold) then Add P Replace C and
C else stop.
14
Learning Extraction Patterns(5/5)

Comparison
AutoSlog
general to specific
human feedback
PALKA
generalization specialization
automated feedback
require more background knowledge
CRYSTAL
specific to general(covering algorithm)
automated feedback
require more background knowledge

Research issues
handling set fills
type of the extracted information
evaluation
determining which method for learning extraction
patterns will give the best results in a new
extraction domain

15
Coreference Resolution and Template
Generation(1/3)

Discourse processing is a major weakness of
existing IE system
generating good heuristics is challenging
assume as input fully parsed sentences
must take into account the accumulated errors
must be able to handle the myriad forms of
coreference across different domains
Coreference problem as a classification task
(Figure 5)
given two phrases and the context in which they
occur,
classify the phrases with respect to whether or
not they refer to the same object

16
Coreference Resolution and Template
Generation(2/3)

MLRAone Bennett 1995
use C4.5 decision tree induction system
tested on the Japanese corpus for the business
joint ventures
use automatically generated data set
66 domain-independent features
evaluated using data sets derived from 250 texts
recall 67-70 , precision 83-88
RESOLVEMcCarthy Lehnert 1995
use C4.5 decision tree induction system
tested on the English corpus for the business
joint ventures(MUC-5)
use manually generated, noise-free data set
include domain-specific features
evaluated using data sets derived from 50 texts
recall 80-85, precision 87-92

17
Coreference Resolution and Template
Generation(3/3)

The results for coreference resolution are
promising
possible to develop automatically trainable
coreference systems that can compete favorably
with manually designed systems
specially designed learning algorithms need not
be developed
symbolic ML techniques offer a mechanism for
evaluating the usefulness of different knowledge
sources
Still, much research remains to be done
additional types of anaphors using a variety of
feature sets
the role of domain-specific information for
coreference resolution
the relative effect of errors from the preceding
phases of text analysis
Trainable systems that tackle Merging Template
Generation
TTGDolan 1991, Wrap-UpSoderland Lehnert
1994
generate a series of decision tree

18
Future Directions

Unsupervised learning algorithms
a means for sidestepping the lack of large,
annotated corpora
Techniques that allow end-users to quickly train
IE systems
through interaction with the system over time
without intervention by NLP system developers

19
IE System in the Domain of Natural Disasters
20
Architecture for an IE System
21
Concept Node for Extracting Damage Information
Concept Node Definition domain-specific semantic
case frame (one slot per frame) Concept the type
of concept to be recognized Trigger the word
that activates the pattern Position the
syntactic position where the concept is expected
to be found Constraint selectional restrictions
that apply to any potential instance of the
concept Enabling Conditions constraints on the
linguistic context of the triggering word that
must be satisfied before the pattern is
activated
22
Learning Information Extraction Patterns
23
A Machine Learning Approach to Coreference
Resolution

Write a Comment

User Comments (0)

About PowerShow.com

Empirical Methods in Information Extraction PowerPoint PPT Presentation