Title: TM and NLP for Biology Research Issues in HPSG Parsing
1TM and NLP for BiologyResearch Issues in HPSG
Parsing
Department of Computer Science School of
Information Science and Technology University
of Tokyo, JAPAN
School of Computer Science National Centre for
Text Mining University of Manchester, UK
2G-protein coupled receptor
D.L.Banville 2006
2005 14,000 papers
Before 1988 9 papers
1992 256 papers
500 times more
2002
2000
1998
1992
1994
1996
1990
1988
1980
1982
1984
1986
1978
1970
1972
1974
1976
1968
1966
1964
Increase in Medline
3NaCTeMwww.nactem.ac.uk
- First such centre in the world
- Funding JISC, BBSRC, EPSRC
- Consortium investment
- Chair in TM (Prof. J. Tsujii, Univ. Tokyo)
- Location Manchester Interdisciplinary Biocentre
(MIB) www.mib.ac.uk funded by the Wellcome Trust - Initial focus biomedical academic community
- Extend services to industry
- Extend focus to other domains (social sciences)
4Consortium
- Universities of Manchester, Liverpool
- Service activity run by MIMAS (National Centre
for Dataset Services), within MC (Manchester
Computing) - Self-funded partners
- San Diego Supercomputing Center
- University of California, Berkeley
- University of Geneva
- University of Tokyo
- Strong industrial academic support
- IBM, AZ, EBI, Wellcome Trust, Sanger Institute,
Unilever, NowGEN, MerseyBio,
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11NLP and TM
Linking text with knowledge
Text Mining Text as a bag of words Words as
surface strings
Natural Language Processing Language as a
complex system linking surface strings of
characters with their meanings Text and words
as structured objects
NLP-based TM
12From surface diversities and ambiguities to
conceptual invariants
Non-Trivial Mappings
Terminology Parsing Paraphrasing
Knowledge Domain
Language Domain
Concepts and Relationships among Them
Linguistic expressions
Motivated Independently of language
13Example
14A protein activates B (Pathway extraction)
Since ., we postulate that only
phosphorylated PHO2 protein could activate the
transcription of PHO5 gene.
Transcription initiation by the sigma(54)-RNA
polymerase holoenzyme requires an
enhancer-binding protein that is thought to
contact sigma(54) to activate transcription.
Full-strength Straufen protein lacking this
insertion is able to assocaite with osker mRNA
and activate its translation, but fails to ..
Retrieval using Regional Algebra
sentence gt (arg1_activate gt protein)
Non-trivial Mapping
Same relations with different Structures
Language Domain
Knowledge Domain
Independently motivated of Language
15Predicate-argument structureParser based on
Probabilistic HPSG (Enju)
S
VP
VP
VP
S
VP
arg3
arg1
NP
ADVP
NP
arg2
arg2
p53 has been shown to directly
activate the Bcl-2 protein
16??/?????HPSG??? (Enju)???
s
Semantic Retrieval System Using Deep Syntax MEDIE
vp
vp
np
pp
arg2
arg1
mod
dt np vp vp pp
np
DT NN VBZ VBN IN PRP
The protein is activated by it
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Demos
27Predicate-argument structureParser based on
Probabilistic HPSG (Enju)
S
VP
VP
VP
S
VP
arg3
arg1
NP
ADVP
NP
arg2
arg2
p53 has been shown to directly
activate the Bcl-2 protein
28(No Transcript)
29(No Transcript)
30(No Transcript)
31Performance of Semantic Parser
Penn Treebank GENIA
Coverage 99.7 99.2
F-Value (PArelations) 87.4 86.4
Sentence Precison 39.2 31.8
Processing Time 0.68sec 1.00sec
32Scalability of TM Tools
Target Corpus MEDLINE corpus
The number of papers 14,792,890
The number of abstracts 7,434,879
The number of sentences 70,815,480
The number of words 1,418,949,650
Compressed data size 3.2GB
Uncompressed data size 10GB
Suppose, for example, that it takes one second
for parsing one sentence.
70 million seconds, that is, about 2 years
33TM and GRID
- Solution
- The entire MEDLINE were parsed by distributed PC
clusters consisting of 340 CPUs - Parallel processing was managed by grid platform
GXP Taura2004 - Experiments
- The entire MEDLINE was parsed in 8 days
- Output
- Syntactic parse trees and predicate argument
structures in XML format - The data sizes of compressed/uncompressed output
were 42.5GB/260GB.
34Efficient Parsing for HPSG
35Background HPSG
- Head-Driven Phrase Structure Grammar (HPSG)
Pollard and Sag, 1994 - Lexicalized and Constraints-based Grammar
- A few Rule Schema? General constraints on
linguistic constructions - Constraints embedded in Lexicon? Word-Specific
Constraints - Constraints between phrase structures and
semantic structures
36Parsing by HPSG
I
like
it
37Parsing by HPSG
Assignment of Lexical Entries
38Application of Rule Schema
Head-Complement
2
39Application of Rule Schema
Subject-Head
2
1
40Inefficiency of HPSG Parsing
- Complex DAGTyped-feature structures
- Abstract machine for Unification (LiLFeS)
- Unification Expensive Operation (?CFG
Approximation CFG Filtering) - Assignment of Lexical Entries
- High reduction of search space / Super tagging
41Filtering with CFG (1/5)
- 2-phased parsing
- Approximate HPSG with CFG with keeping important
constraints. - Obtained CFG might over-generate, but can be used
in filtering. - Rewriting in CFG is far less expensive than that
of application of rule schemata, principles and
so on.
Feature Structures
HPSG
Compile
CFG
Input Sentences
Built-in CFG Parser
LiLFeS Unification
Parsing
Output
Complete parse trees
42Inefficiency of HPSG Parsing
- Complex DAGTyped-feature structures
- Abstract machine for Unification (LiLFeS)
- Unification Expensive Operation (?CFG
Approximation CFG Filtering) - Assignment of Lexical Entries
- High reduction of search space / Super tagging
43HPSG and Parsing
Most of Constraints are in LEs
Assignment of LEs Parsing results are
implicitly determined
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
like
44Supertagging and Efficient Parsing Clark and
Curran, 2004 Ninomiya et al., 2006
- Supertagging
- P(Seq-Of-LEs SeQ-Of-Words)
- Selection of lexical entry assignments
- Bangalore and Joshi, 1999
Threshold
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
High Probability
it
I
45Chart parsing
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
it
I
46Efficient Parser
- Smaller Number of LE assignments
- LE assignments that lead to complete parse trees
Previous methods1. Chart parsing by using
initial LE assignment 2. Extend LE assignment
when parsing fails
47System Overview
Input sentence
I like it
CFG Filtering
Supertagger
Deterministic Shift/Reduce Parser
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
P High
I
it
48Experiment Results
LP() LR() F1() Avg. time
Staged/Deterministic model 86.93 86.47 86.70 30ms/snt
Previous method 1 (SupertaggerChartParser) 87.35 86.29 86.81 183ms/snt
Previous method 2 (Unigram ChartParser) 84.96 84.25 84.60 674ms/snt
6 times faster 20 times faster than the initial
model
49Domain/Text Type Adaptation
50F-score Training Time (Sec)
Baseline(PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining (GENIA) 88.45 14,695
Retraining(PTBGENIA)) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
51Adaptation with Reference Distribution
Lexical Assignment
Syntactic Preference
Feature function
Feature weight
Original model
5290
89
88
score
87
-
Baseline (PTB)
F
86
Simple Retraining (GENIA)
Retraining (GENIAPTB)
85
Structure with Ref.Dist
Lexical with RefDist
84
Lexical/Structure woth RefDist
83
0
2000
4000
6000
8000
Number of Sentence of the GENIA Training Set
5390
89
88
score
87
-
F
86
Structure with RefDist
85
Lexicon woth RefDist
Lex/Str with RefDist
84
83
0
10000
20000
30000
Training Time (Sec)
54F-score Training Time (Sec)
Baseline(PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining (GENIA) 88.45 14,695
Retraining(PTBGENIA)) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
55Tool1 POS Tagger
The peri-kappa B site mediates human
immunodeficiency DT NN NN NN VBZ
JJ NN virus type 2 enhancer
activation in monocytes NN NN CD
NN NN IN NNS
- General-Purpose POS taggers, trained by WSJ
- Brills tagger, TnT tagger, MX POST, etc.
- 97
- General-Purpose POS taggers do not work well for
MEDLINE abstracts
56Errors seen in TnT tagger (Brants 2000)
A chromosomal translocation in DT
JJ NN IN and
membrane potential after mitogen binding. CC
NN NN IN NN JJ
two factors, which bind to the same kappa B
enhancers CD NNS WDT NN TO DT JJ
NN NN NNS by analysing the Ag amino
acid sequence. IN VBG DT VBG JJ
NN NN to contain more T-cell determinants
than TO VB RBR JJ NNS
IN Stimulation of interferon beta gene
transcription in vitro by NN IN
JJ JJ NN NN IN NN
IN
57Performance of GENIA Tagger
(Ref.) TnT tagger
Training corpus WSJ GENIA
WSJ 97.0 84.3
GENIA 75.2 98.1
WSJGENIA 96.9 98.1
Training corpus WSJ GENIA
WSJ 96.7 84.3
GENIA 80.1 97.9
WSJGENIA 96.5 97.5
Some degradations (0.2 0.4) were observed,
compared with the taggers trained by pure
corpora
No degradation of the tagger trained by the mixed
corpus
58CRF-based POS Active Learning GENIA
3,000 sentences 98.4 20,000 sentences 98.58
59CRF-based POS Active Learning PTB
10,000 sentences 96.76 Best Performance 97.18
60Applications
61Our Policy for Information Extraction
- Separate a domain/task-independent part from a
domain/task-specific part.
IE System
Task-independent
Task-specific
62Our Policy
- Separate a domain/task-independent part from a
domain/task-specific part.
PAS Predicate-Argument Structure
IE System
a full parser normalizes sentencesinto PASs
Task-independent
Task-specific
extraction rules on PASs
63Our Policy
- Distinguish a domain-independent part from a
domain-specific part.
PAS Predicate-Argument Structure
IE System
a full parser normalizes sentencesinto PASs
Task-independent
Task-specific
extraction rules on PASs
Learned automatically from a corpus
64GENIA Event Annotation - example
ClueType
LinkTheme
LinkCause
ClueLoc
ClueType
- For an identified event in the given sentence,
- classify the type of events and record the text
span giving the clue of it (ClueType). - identify the theme of the events and record the
text span linking the theme to the event
(LinkTheme). - identify the cause of the events and record the
text span linking the cause to the event
(LinkCause). - record the environment (location, time) of the
events (ClueLoc, ClueTime).
65Gene_expression
- Theme patterns observed (2,958)
- Protein 2,308
- DNA 591
- RNA 25
- Peptide 4
- Protein Protein 2
- Erroneous 27
- Keywords
- coexpress, nonexpress, overexpress, express,
biosynthesis, product, synthesize, constitute,
coexpression
66Transcription
- Theme patterns observed (929)
- DNA 449
- RNA 272
- Protein 167
- Peptide 2
- Erroneous 22
- Keyword
- Transcrib, transcript, synthesi, express,
67Localization
- ClueLoc
- NONE 241
- nuclear 140
- to the nucleus 12
- into the nucleus 11
- Cytoplasmic 8
- in the cytoplasm 7
- macrophages 5
- nuclear in t lymphocytes 4
- monocytes 4
- in the nucleus 4
- in the cytosol 4
- in colostrum 4
- from the cytoplasm to the nucleus 4
- Theme patterns observed (730)
- Protein 608
- Lipid 31
- Atom 29
- Other_organic_compound 14
- DNA 12
- Virus 5
- Carbohydrate 5
- RNA 4
- Inorganic 4
- Peptide 3
- Keywords
- Translocation, sectetion, release, localization,
mobilization, uptake, secrete, import, transport,
translocate, sequester, influx, mograte,
localisation, move, delivery, export,
68Localization
- Keywords and Locations
- translocation (166)
- nuclear 108
- NONE 38
-
- secretion (100)
- NONE 57
- name_of_cells 43
- release (80)
- NONE 51
- name_of_cells 19
-
- localization (30)
- nuclear 25
- intracellular 3
- uptake (24)
- NONE 14
- name_of_cells 20
- Keywords and Themes
- translocation (166)
- Protein 161
- Virus 4
- RNA 1
- secretion (100)
- Protein 98
- Lipid 1
- Peptide 1
- release (80)
- Protein 67
- Other_organic_compoun 6
- Lipid 3
- localization (30)
- Protein 30
- uptake (24)
- Lipid 15
- Carbohydrate 5
- Protein 4
69Future Plan
70(No Transcript)
71(No Transcript)
72Future Directions
- Domain Adaptation Inter-operability
- High performance can be obtained by using domain
specific characteristics and domain semantics - Differences among abstracts, full papers,
comments in DBs - Standardized Interfaces (API) of NLP tools
- Text Archives
- Abstracts Full Papers Comments/Summary
Descriptions in DBs - Combining NLP tools with Mining tools
- Knowledge Discovery (Disease Gene Association)
- Hypotheses Generation
- Automatic Data Interpretation
73Future Directions
- Domain Adaptation Inter-operability
- High performance can be obtained by using domain
specific characteristics and domain semantics - Differences among abstracts, full papers,
comments in DBs - Standardized Interfaces (API) of NLP tools
- Text Archives
- Abstracts Full Papers Comments/Summary
Descriptions in DBs - Combining NLP tools with Mining tools
- Knowledge Discovery (Disease Gene Association)
- Hypotheses Generation
- Automatic Data Interpretation