Title: Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction
1Automatic Acquisition ofLexical Classes and
Extraction Patternsfor Information Extraction
- Kiyoshi Sudo
- Ph.D. Research Proposal
- New York University
Committee Ralph Grishman Satoshi Sekine I. Dan
Melamed
2Outline
- Introduction
- Research Proposal
- Problem Setting
- Approach
- Application to Information Extraction
- Discussion
3MUC Scenario Template Task
- MURREE, Pakistan (AP) -- Masked gunmen firing
Kalashnikov rifles burst through the front gates
of a Christian school Monday, killing six people
and wounding three in the latest attack against
Western interests since Pakistan joined the war
against terrorism.
4MUC Scenario Template Task
- MURREE, Pakistan (AP) -- Masked gunmen firing
Kalashnikov rifles burst through the front gates
of a Christian school Monday, killing six people
and wounding three in the latest attack against
Western interests since Pakistan joined the war
against terrorism.
5High Cost forAcquiring Knowledge-Base
- Find extraction patterns
- Find relevant documents
- Find relevant events
- Analyze sentences
- Find domain-specific lexicon
- Find existing KB (e.g. thesaurus, gazetteers)
6Prior Work
Automatic Knowledge Acquisition
Lexical Acquisition
Pattern Acquisition
Mutual Bootstrapping (Riloff and Jones 1999)
Pattern Discovery with Document
Re-ranking (Yangarber et al. 2000)
Simultaneous Multi-Semantic Class (Thelen and
Riloff 2002)
(Yangarber et al. 2002)
Pattern Acquisition for QA (Ravichandran and
Hovy 2002)
7Challenge
User
Seed Lexicon Seed Pattern
Expanded Lexicon Expanded Pattern Set
Knowledge Base
8Meeting the Challenge
User
Seed Lexicon Seed Pattern
Expanded Lexicon Expanded Pattern Set
Knowledge Base
9Semantic Clustering
- Description specific enough
- to define the scenario
- (terrorism, bombing, kidnapping)
- Tell me about the terrorism action,
- such as bombing and kidnapping.
- Find Scenario-specific Semantic Clusters
- each of which consists of
10Benefit for User
- Low-cost
- Knowledge-base Acquisition
- for IE systems
11Extraction Patterns
where
c unifies with the context that is defined by
semantic class L
Vsubj
Vobj
(cf. Sudo et al. 2001)
12Outline
- Introduction
- Research Proposal
- Problem Setting
- Approach
- Information Extraction
- Evaluation
13Overview
Semantic Clustering
14Overview
Semantic Clustering
15Information Retrieval
- Get Relevant Document set
- Get list of lexical items and extraction patterns
ordered by relevance to the scenario - TF/IDF scoring
R
16Example of TF/IDF scoring(Management Succession
Business)
300 documents retrieved From WSJ (7/94 - 8/94)
Extracted by MINIPAR (Lin 1998)
17Overview
Semantic Clustering
18Bootstrapping
- Find one cluster that consists of Lexicon and
Extraction Patterns
- Assumption
- Patterns provide Lexical Classes.
- Lexicon provides contextual information.
Riloff and Jones 1999 Agichtein and Gravano 2000
19Bootstrapping (Cont.)
- Algorithm (cf. Riloff and Jones 1999)
- Given
- the ordered list of terms
- the ordered list of extraction patterns
- Lexicon (), Pattern ()
- w ? the most relevant term in the list and add it
into Lexicon - p ? the most relevant pattern among those that
extract w. - Add p into Pattern
- w ?the most relevant term among those that are
extracted by p - Add w into Lexicon
- Go to 1
20Example of Bootstrapping(Management Succession
Business)
From WSJ (7/94 - 8/94)
Extracted by MINIPAR (Lin 1998)
21Example of Bootstrapping(Management Succession
Business)
From WSJ (7/94 - 8/94)
Extracted by MINIPAR (Lin 1998)
22ProblemPolysemous Lexicon, Pattern
- Lexicon can be ambiguous
- e.g. Clinton (Person, Organization, Location )
- Extraction patterns can be ambiguous
- e.g. be killed in ltxgt (x Location, Date )
- Needs more study
- more restriction
- Probabilistic Model ??
23Overview
Semantic Clustering
Source
Information Retrieval
Boot- strapping
Query Expansion
24Query Expansion
- Generalize terms in a query with a newly
discovered cluster - cf. Rocchio 1971 (Vector model)
- Zhai and Lafferty 2001 (Language-modeling)
25Overview
Semantic Clustering
Source
Information Retrieval
Boot- strapping
Query Expansion
26Outline
- Introduction
- Research Proposal
- Problem Setting
- Approach
- Application to Information Extraction
- Discussion
27Application toInformation Extraction
28Human Intervention
- Extraction patterns
- Event pattern
- Context contains a verb or nominalization of verb
- Used for event extraction and role assignment
- e.g. (terrorist, fire, x)
- Local pattern
- Context contains only enough information to
recognize semantic class - Used for entity recognition only
- e.g. (x,Inc.)
- Association of Event Pattern to Role
- e.g. (company, hire, x)?PersonIn and (company,
fire, x)?PersonOut
29Outline
- Introduction
- Research Proposal
- Problem Setting
- Approach
- Application to Information Extraction
- Discussion
30Discussion
- Domain Portability
- User only needs to specify the scenario
- Language Portability
- Language-dependent Tools
- Segmentation (Lemmatization)
- Dependency Parsing
31Evaluation
- MUC-style (Scenario-Template task)
- Slot-base
- Precision, Recall, F-measure
- Domain Portability
- Several pre-defined tasks that differ in
difficulty - Language Portability
- Japanese
- English
32Contribution
- Tool for Domain Analysis
- Low-cost Knowledge-base Acquisition
- Towards Open-domain Information Extraction
33Conclusion
- Proposed New Approach for Knowledge-base
Acquisition (Semantic Clustering) - Discussed Application of Acquired KB to
Information Extraction (Human Intervention and
Local vs. Event patterns) - Discussed Evaluation with several predefined
MUC-style tasks different in difficulty and
across languages (Domain portability and Language
portability)
34ToDo
- Implementation
- Preparation for Evaluation
- Evaluation
35Time for Questions(Conclusion)
- Proposed New Approach for Knowledge-base
Acquisition (Semantic Clustering) - Discussed Application of Acquired KB to
Information Extraction (Human Intervention and
Local vs. Event patterns) - Discussed Evaluation with several predefined
MUC-style tasks different in difficulty and
across languages (Domain portability and Language
portability)