Empirical Methods in Information Extraction - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Empirical Methods in Information Extraction

Description:

The Architecture of an Information Extraction System. The Role of Corpus-Based Language Learning Algorithms ... coreference resolution, or anaphora resolution ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 24
Provided by: nlpKo
Category:

less

Transcript and Presenter's Notes

Title: Empirical Methods in Information Extraction


1
Empirical Methods in Information Extraction
  • - Claire Cardie
  • ????????
  • ? ? ?
  • 1999. 11. 2.

2
Contents
  • Introduction
  • The Architecture of an Information Extraction
    System
  • The Role of Corpus-Based Language Learning
    Algorithms
  • Learning Extraction Patterns
  • Coreference Resolution and Template Generation
  • Future Directions

3
Introduction(1/2)
  • Information Extraction System
  • inherently domain specific
  • takes as input an unrestricted text and
    summarizes the text with respect to a
    prespecified topic or domain of interest. (Figure
    1)
  • skim a text to find relevant sections and then
    focus only on these sections.
  • MUC performance evaluation
  • recall
  • precision
  • applications
  • analyzing
  • terrorist activities, business joint ventures,
    medical patient records,
  • building
  • KB from web pages, job listing DB from
    newsgroups / web sites / advertisements, weather
    forecast DB from web pages, ...

( correct slot-fillers in output template) / (
slot-fillers in answer key)
( correct slot-fillers in output template) / (
slot-fillers in output template)
4
Introduction(2/2)
  • Problems in todays IE systems
  • accuracy
  • the errors of an automated IE system are
  • due to its relative shallow understanding of the
    input text
  • difficult to track down and to correct
  • portability
  • domain-specific nature
  • manually modifying and adding domain-specific
    linguistic knowledge to an existing NLP system is
    slow and error-prone.

We will see that empirical methods for IE are
corpus-based, machine learning algorithms.
5
The Architecture of an IE System(1/2)
  • Approaches to IE in the early days
  • traditional NLP techniques vs. keyword matching
    techniques
  • Standard architecture for IE systems (Figure 2)
  • tokenization and tagging
  • tag each word with respect to POS and possibly
    semantic class
  • sentence analysis
  • one or more stages of syntactic analysis
  • identify
  • noun/verb groups, prepositional phrases,
    subjects, objects, conjunctions,
  • semantic entities relevant to the extraction
    topic
  • the system need only perform partial parsing
  • looks for fragments of text that can be reliably
    recognized
  • the ambiguity resolution decisions can be
    postponed

6
The Architecture of an IE System(2/2)
  • Standard architecture for IE systems (continued)
  • extraction
  • the first entirely domain-specific component
  • identifies domain-specific relations among
    relevant entities in the text
  • merging
  • coreference resolution, or anaphora resolution
  • determines whether it refers to an existing
    entity or whether it is new
  • determine the implicit subjects of all verb
    phrases
  • discourse-level inference
  • template generation
  • determines the number of distinct events in the
    text
  • maps the individually extracted pieces of
    information onto each event
  • produces output templates
  • the best place to apply domain-specific
    constraint
  • some slots require set fills, or require
    normalization of their fillers.

7
The Role of Corpus-Based Language Learning
Algorithms(1/3)
  • Q How have researchers used empirical methods in
    NLP to improve the accuracy and portability of IE
    systems?
  • A corpus-based language learning algorithms have
    been used to improve individual components of the
    IE system.
  • For language tasks that are domain-independent
    and syntactic
  • annotated corpora already exist
  • POS tagging, partial parsing, WSD
  • the importance of WSD for IE task remains
    unclear.
  • NL learning techniques are more difficult to
    apply to subsequent stages of IE.
  • learning extraction patterns, coreference
    resolution, template generation

8
The Role of Corpus-Based Language Learning
Algorithms(2/3)
  • The problems of applying empirical methods
  • no corpora annotated with the appropriate
    semantic domain-specific supervisory
    information
  • corpus for IE lttext, output templategt
  • the output templates
  • say nothing about which occurrence of the string
    is responsible for the extraction
  • provide no direct means for learning patterns to
    extract symbols not necessarily appearing
    anywhere in the text(set fills)
  • the semantic domain-specific language-processing
    skills require the output of earlier levels of
    analysis(tagging partial parsing).
  • complicate to generate the training examples
  • whenever the behavior of these earlier modules
    changes,
  • new training examples must be generated
  • the learning algorithms for later stages must be
    retrained
  • learning algorithms must deal with noise caused
    by errors from earlier components ? new
    algorithms need to be developed

9
The Role of Corpus-Based Language Learning
Algorithms(3/3)
  • Data-driven nature of corpus-based approaches
  • accuracy
  • when the training data is derived from the same
    type of texts that the IE system is to process,
  • the acquired language skills are automatically
    tuned to that corpus, increasing the accuracy of
    the system.
  • portability
  • because each NLU skill is learned automatically
    rather than being manually coded,
  • that skill can be moved quickly from one IE
    system to another by retraining the appropriate
    component.

10
Learning Extraction Patterns(1/5)
  • The role for empirical methods in the Extraction
    phase
  • knowledge acquisition to automate the
    acquisition of good extraction patterns
  • AutoSlogRiloff 1993
  • learns extraction patterns in the form of
    domain-specific concept node definitions for use
    with the CIRCUS parser. (Figure 3)
  • learns concept node definitions via a one-shot
    learning algorithm
  • background knowledge
  • a small set of general linguistic patterns
    (approximately 13)
  • requires human feedback loop, which filters bad
    extraction patterns
  • accuracy 98, portability 5 hours
  • critical step towards building IE systems that
    are trainable entirely by end-users
  • (Figure 4)

11
Learning Extraction Patterns(2/5)
AutoSlogs Learning Algorithm
Given a noun phrase to be extracted 1. Find the
sentence from which the noun phrase
originated. 2. Present the sentence to the
partial parser for processing. 3. Apply the
linguistic patterns in order. 4. When a pattern
applies, generate a concept node definition from
the matched constituents, their context, the
concept type provided in the annotation for the
target noun phrase, and the predefined semantic
class for the filler.
ltactive-voice-verbgt followed by
lttarget-npgtltdirect objectgt
Concept ltltconceptgt of lttarget-npgtgt Trigger
ltltverbgt of ltactive-voice-verbgtgt Position
direct-object Constraints ((ltltsemantic classgt
of ltconceptgtgt)) Enabling Conditions
((active-voice))
12
Learning Extraction Patterns(3/5)
  • PALKAKim Moldovan 1995
  • background knowledge
  • concept hierarchy
  • a set of keywords that can be used to trigger
    each pattern
  • comprises a set of generic semantic case frame
    definitions for each type of information to be
    extracted
  • semantic class lexicon
  • CRYSTALSoderland 1995
  • triggers comprise a much more detailed
    specification of linguistic context
  • employs a covering algorithm
  • medical diagnosis domain
  • precision 50-80 , recall 45-75

13
Learning Extraction Patterns(4/5)
CRYSTALs Learning Algorithm
1. Begin by generating the most specific concept
node possible for every phrase to be extracted in
the training texts. 2. For each concept node C
2.1. Find the most similar concept node C.
2.2. Relax the constrains of each just enough to
unify C and C. 2.3. Test the new extraction
pattern P against the training corpus. If (error
rate lt threshold) then Add P Replace C and
C else stop.
14
Learning Extraction Patterns(5/5)
  • Comparison
  • AutoSlog
  • general to specific
  • human feedback
  • PALKA
  • generalization specialization
  • automated feedback
  • require more background knowledge
  • CRYSTAL
  • specific to general(covering algorithm)
  • automated feedback
  • require more background knowledge
  • Research issues
  • handling set fills
  • type of the extracted information
  • evaluation
  • determining which method for learning extraction
    patterns will give the best results in a new
    extraction domain

15
Coreference Resolution and Template
Generation(1/3)
  • Discourse processing is a major weakness of
    existing IE system
  • generating good heuristics is challenging
  • assume as input fully parsed sentences
  • must take into account the accumulated errors
  • must be able to handle the myriad forms of
    coreference across different domains
  • Coreference problem as a classification task
    (Figure 5)
  • given two phrases and the context in which they
    occur,
  • classify the phrases with respect to whether or
    not they refer to the same object

16
Coreference Resolution and Template
Generation(2/3)
  • MLRAone Bennett 1995
  • use C4.5 decision tree induction system
  • tested on the Japanese corpus for the business
    joint ventures
  • use automatically generated data set
  • 66 domain-independent features
  • evaluated using data sets derived from 250 texts
  • recall 67-70 , precision 83-88
  • RESOLVEMcCarthy Lehnert 1995
  • use C4.5 decision tree induction system
  • tested on the English corpus for the business
    joint ventures(MUC-5)
  • use manually generated, noise-free data set
  • include domain-specific features
  • evaluated using data sets derived from 50 texts
  • recall 80-85, precision 87-92

17
Coreference Resolution and Template
Generation(3/3)
  • The results for coreference resolution are
    promising
  • possible to develop automatically trainable
    coreference systems that can compete favorably
    with manually designed systems
  • specially designed learning algorithms need not
    be developed
  • symbolic ML techniques offer a mechanism for
    evaluating the usefulness of different knowledge
    sources
  • Still, much research remains to be done
  • additional types of anaphors using a variety of
    feature sets
  • the role of domain-specific information for
    coreference resolution
  • the relative effect of errors from the preceding
    phases of text analysis
  • Trainable systems that tackle Merging Template
    Generation
  • TTGDolan 1991, Wrap-UpSoderland Lehnert
    1994
  • generate a series of decision tree

18
Future Directions
  • Unsupervised learning algorithms
  • a means for sidestepping the lack of large,
    annotated corpora
  • Techniques that allow end-users to quickly train
    IE systems
  • through interaction with the system over time
  • without intervention by NLP system developers

19
IE System in the Domain of Natural Disasters
20
Architecture for an IE System
21
Concept Node for Extracting Damage Information
Concept Node Definition domain-specific semantic
case frame (one slot per frame) Concept the type
of concept to be recognized Trigger the word
that activates the pattern Position the
syntactic position where the concept is expected
to be found Constraint selectional restrictions
that apply to any potential instance of the
concept Enabling Conditions constraints on the
linguistic context of the triggering word that
must be satisfied before the pattern is
activated
22
Learning Information Extraction Patterns
23
A Machine Learning Approach to Coreference
Resolution
Write a Comment
User Comments (0)
About PowerShow.com