TM and NLP for Biology Research Issues in HPSG Parsing - PowerPoint PPT Presentation

About This Presentation

Title:

TM and NLP for Biology Research Issues in HPSG Parsing

Description:

Full-strength Straufen protein lacking this insertion is able to assocaite ... 1. Chart parsing by using initial LE assignment. 2. Extend LE assignment when ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 66

Provided by: ltrcI

Category:

more less

Transcript and Presenter's Notes

Title: TM and NLP for Biology Research Issues in HPSG Parsing

1
TM and NLP for BiologyResearch Issues in HPSG
Parsing

Junichi TSUJII

Department of Computer Science School of
Information Science and Technology University
of Tokyo, JAPAN
School of Computer Science National Centre for
Text Mining University of Manchester, UK
2
G-protein coupled receptor
D.L.Banville 2006
2005 14,000 papers
Before 1988 9 papers
1992 256 papers
500 times more
2002
2000
1998
1992
1994
1996
1990
1988
1980
1982
1984
1986
1978
1970
1972
1974
1976
1968
1966
1964
Increase in Medline
3
NaCTeMwww.nactem.ac.uk

First such centre in the world
Funding JISC, BBSRC, EPSRC
Consortium investment
Chair in TM (Prof. J. Tsujii, Univ. Tokyo)
Location Manchester Interdisciplinary Biocentre
(MIB) www.mib.ac.uk funded by the Wellcome Trust
Initial focus biomedical academic community
Extend services to industry
Extend focus to other domains (social sciences)

4
Consortium

Universities of Manchester, Liverpool
Service activity run by MIMAS (National Centre
for Dataset Services), within MC (Manchester
Computing)
Self-funded partners
San Diego Supercomputing Center
University of California, Berkeley
University of Geneva
University of Tokyo
Strong industrial academic support
IBM, AZ, EBI, Wellcome Trust, Sanger Institute,
Unilever, NowGEN, MerseyBio,

5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
NLP and TM
Linking text with knowledge
Text Mining Text as a bag of words Words as
surface strings
Natural Language Processing Language as a
complex system linking surface strings of
characters with their meanings Text and words
as structured objects
NLP-based TM
12
From surface diversities and ambiguities to
conceptual invariants
Non-Trivial Mappings
Terminology Parsing Paraphrasing
Knowledge Domain
Language Domain
Concepts and Relationships among Them
Linguistic expressions
Motivated Independently of language
13
Example
14
A protein activates B (Pathway extraction)
Since ., we postulate that only
phosphorylated PHO2 protein could activate the
transcription of PHO5 gene.
Transcription initiation by the sigma(54)-RNA
polymerase holoenzyme requires an
enhancer-binding protein that is thought to
contact sigma(54) to activate transcription.
Full-strength Straufen protein lacking this
insertion is able to assocaite with osker mRNA
and activate its translation, but fails to ..
Retrieval using Regional Algebra
sentence gt (arg1_activate gt protein)
Non-trivial Mapping
Same relations with different Structures
Language Domain
Knowledge Domain
Independently motivated of Language
15
Predicate-argument structureParser based on
Probabilistic HPSG (Enju)
S
VP
VP
VP
S
VP
arg3
arg1
NP
ADVP
NP
arg2
arg2
p53 has been shown to directly
activate the Bcl-2 protein
16
??/?????HPSG??? (Enju)???
s
Semantic Retrieval System Using Deep Syntax MEDIE
vp
vp
np
pp
arg2
arg1
mod
dt np vp vp pp
np
DT NN VBZ VBN IN PRP
The protein is activated by it
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Demos

MEDIE
Info-PubMed

27
Predicate-argument structureParser based on
Probabilistic HPSG (Enju)
S
VP
VP
VP
S
VP
arg3
arg1
NP
ADVP
NP
arg2
arg2
p53 has been shown to directly
activate the Bcl-2 protein
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Performance of Semantic Parser
Penn Treebank GENIA
Coverage 99.7 99.2
F-Value (PArelations) 87.4 86.4
Sentence Precison 39.2 31.8
Processing Time 0.68sec 1.00sec
32
Scalability of TM Tools
Target Corpus MEDLINE corpus
The number of papers 14,792,890
The number of abstracts 7,434,879
The number of sentences 70,815,480
The number of words 1,418,949,650
Compressed data size 3.2GB
Uncompressed data size 10GB
Suppose, for example, that it takes one second
for parsing one sentence.
70 million seconds, that is, about 2 years
33
TM and GRID

Solution
The entire MEDLINE were parsed by distributed PC
clusters consisting of 340 CPUs
Parallel processing was managed by grid platform
GXP Taura2004
Experiments
The entire MEDLINE was parsed in 8 days
Output
Syntactic parse trees and predicate argument
structures in XML format
The data sizes of compressed/uncompressed output
were 42.5GB/260GB.

34
Efficient Parsing for HPSG
35
Background HPSG

Head-Driven Phrase Structure Grammar (HPSG)
Pollard and Sag, 1994
Lexicalized and Constraints-based Grammar
A few Rule Schema? General constraints on
linguistic constructions
Constraints embedded in Lexicon? Word-Specific
Constraints
Constraints between phrase structures and
semantic structures

36
Parsing by HPSG
I
like
it
37
Parsing by HPSG
Assignment of Lexical Entries
38
Application of Rule Schema
Head-Complement
2
39
Application of Rule Schema
Subject-Head
2
1
40
Inefficiency of HPSG Parsing

Complex DAGTyped-feature structures
Abstract machine for Unification (LiLFeS)
Unification Expensive Operation (?CFG
Approximation CFG Filtering)
Assignment of Lexical Entries
High reduction of search space / Super tagging

41
Filtering with CFG (1/5)

2-phased parsing
Approximate HPSG with CFG with keeping important
constraints.
Obtained CFG might over-generate, but can be used
in filtering.
Rewriting in CFG is far less expensive than that
of application of rule schemata, principles and
so on.

Feature Structures
HPSG

Compile
CFG
Input Sentences
Built-in CFG Parser
LiLFeS Unification
Parsing
Output
Complete parse trees
42
Inefficiency of HPSG Parsing

Complex DAGTyped-feature structures
Abstract machine for Unification (LiLFeS)
Unification Expensive Operation (?CFG
Approximation CFG Filtering)
Assignment of Lexical Entries
High reduction of search space / Super tagging

43
HPSG and Parsing
Most of Constraints are in LEs
Assignment of LEs Parsing results are
implicitly determined
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
like
44
Supertagging and Efficient Parsing Clark and
Curran, 2004 Ninomiya et al., 2006

Supertagging
P(Seq-Of-LEs SeQ-Of-Words)
Selection of lexical entry assignments
Bangalore and Joshi, 1999

Threshold
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
High Probability
it
I
45
Chart parsing
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
it
I
46
Efficient Parser

Smaller Number of LE assignments
LE assignments that lead to complete parse trees

Previous methods1. Chart parsing by using
initial LE assignment 2. Extend LE assignment
when parsing fails
47
System Overview
Input sentence
I like it
CFG Filtering
Supertagger
Deterministic Shift/Reduce Parser
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
P High
I
it
48
Experiment Results
LP() LR() F1() Avg. time
Staged/Deterministic model 86.93 86.47 86.70 30ms/snt
Previous method 1 (SupertaggerChartParser) 87.35 86.29 86.81 183ms/snt
Previous method 2 (Unigram ChartParser) 84.96 84.25 84.60 674ms/snt
6 times faster 20 times faster than the initial
model
49
Domain/Text Type Adaptation
50
F-score Training Time (Sec)
Baseline(PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining (GENIA) 88.45 14,695
Retraining(PTBGENIA)) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
51
Adaptation with Reference Distribution
Lexical Assignment
Syntactic Preference
Feature function
Feature weight
Original model
52
90
89
88
score
87
-
Baseline (PTB)
F
86
Simple Retraining (GENIA)
Retraining (GENIAPTB)
85
Structure with Ref.Dist
Lexical with RefDist
84
Lexical/Structure woth RefDist
83
0
2000
4000
6000
8000
Number of Sentence of the GENIA Training Set
53
90
89
88
score
87
-
F
86
Structure with RefDist
85
Lexicon woth RefDist
Lex/Str with RefDist
84
83
0
10000
20000
30000
Training Time (Sec)
54
F-score Training Time (Sec)
Baseline(PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining (GENIA) 88.45 14,695
Retraining(PTBGENIA)) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
55
Tool1 POS Tagger
The peri-kappa B site mediates human
immunodeficiency DT NN NN NN VBZ
JJ NN virus type 2 enhancer
activation in monocytes NN NN CD
NN NN IN NNS

General-Purpose POS taggers, trained by WSJ
Brills tagger, TnT tagger, MX POST, etc.
97
General-Purpose POS taggers do not work well for
MEDLINE abstracts

56
Errors seen in TnT tagger (Brants 2000)
A chromosomal translocation in DT
JJ NN IN and
membrane potential after mitogen binding. CC
NN NN IN NN JJ
two factors, which bind to the same kappa B
enhancers CD NNS WDT NN TO DT JJ
NN NN NNS by analysing the Ag amino
acid sequence. IN VBG DT VBG JJ
NN NN to contain more T-cell determinants
than TO VB RBR JJ NNS
IN Stimulation of interferon beta gene
transcription in vitro by NN IN
JJ JJ NN NN IN NN
IN
57
Performance of GENIA Tagger

GENIA tagger

(Ref.) TnT tagger
Training corpus WSJ GENIA
WSJ 97.0 84.3
GENIA 75.2 98.1
WSJGENIA 96.9 98.1
Training corpus WSJ GENIA
WSJ 96.7 84.3
GENIA 80.1 97.9
WSJGENIA 96.5 97.5
Some degradations (0.2 0.4) were observed,
compared with the taggers trained by pure
corpora
No degradation of the tagger trained by the mixed
corpus
58
CRF-based POS Active Learning GENIA
3,000 sentences 98.4 20,000 sentences 98.58
59
CRF-based POS Active Learning PTB
10,000 sentences 96.76 Best Performance 97.18
60
Applications
61
Our Policy for Information Extraction

Separate a domain/task-independent part from a
domain/task-specific part.

IE System
Task-independent
Task-specific
62
Our Policy

Separate a domain/task-independent part from a
domain/task-specific part.

PAS Predicate-Argument Structure
IE System
a full parser normalizes sentencesinto PASs
Task-independent
Task-specific
extraction rules on PASs
63
Our Policy

Distinguish a domain-independent part from a
domain-specific part.

PAS Predicate-Argument Structure
IE System
a full parser normalizes sentencesinto PASs
Task-independent
Task-specific
extraction rules on PASs
Learned automatically from a corpus
64
GENIA Event Annotation - example
ClueType
LinkTheme
LinkCause
ClueLoc
ClueType

For an identified event in the given sentence,
classify the type of events and record the text
span giving the clue of it (ClueType).
identify the theme of the events and record the
text span linking the theme to the event
(LinkTheme).
identify the cause of the events and record the
text span linking the cause to the event
(LinkCause).
record the environment (location, time) of the
events (ClueLoc, ClueTime).

65
Gene_expression

Theme patterns observed (2,958)
Protein 2,308
DNA 591
RNA 25
Peptide 4
Protein Protein 2
Erroneous 27
Keywords
coexpress, nonexpress, overexpress, express,
biosynthesis, product, synthesize, constitute,

coexpression
66
Transcription

Theme patterns observed (929)
DNA 449
RNA 272
Protein 167
Peptide 2
Erroneous 22
Keyword
Transcrib, transcript, synthesi, express,

67
Localization

ClueLoc
NONE 241
nuclear 140
to the nucleus 12
into the nucleus 11
Cytoplasmic 8
in the cytoplasm 7
macrophages 5
nuclear in t lymphocytes 4
monocytes 4
in the nucleus 4
in the cytosol 4
in colostrum 4
from the cytoplasm to the nucleus 4

Theme patterns observed (730)
Protein 608
Lipid 31
Atom 29
Other_organic_compound 14
DNA 12
Virus 5
Carbohydrate 5
RNA 4
Inorganic 4
Peptide 3

Keywords
Translocation, sectetion, release, localization,
mobilization, uptake, secrete, import, transport,
translocate, sequester, influx, mograte,
localisation, move, delivery, export,

68
Localization

Keywords and Locations
translocation (166)
nuclear 108
NONE 38
secretion (100)
NONE 57
name_of_cells 43
release (80)
NONE 51
name_of_cells 19
localization (30)
nuclear 25
intracellular 3
uptake (24)
NONE 14
name_of_cells 20

Keywords and Themes
translocation (166)
Protein 161
Virus 4
RNA 1
secretion (100)
Protein 98
Lipid 1
Peptide 1
release (80)
Protein 67
Other_organic_compoun 6
Lipid 3
localization (30)
Protein 30
uptake (24)
Lipid 15
Carbohydrate 5
Protein 4

69
Future Plan
70
(No Transcript)
71
(No Transcript)
72
Future Directions

Domain Adaptation Inter-operability
High performance can be obtained by using domain
specific characteristics and domain semantics
Differences among abstracts, full papers,
comments in DBs
Standardized Interfaces (API) of NLP tools
Text Archives
Abstracts Full Papers Comments/Summary
Descriptions in DBs
Combining NLP tools with Mining tools
Knowledge Discovery (Disease Gene Association)
Hypotheses Generation
Automatic Data Interpretation

73
Future Directions

Domain Adaptation Inter-operability
High performance can be obtained by using domain
specific characteristics and domain semantics
Differences among abstracts, full papers,
comments in DBs
Standardized Interfaces (API) of NLP tools
Text Archives
Abstracts Full Papers Comments/Summary
Descriptions in DBs
Combining NLP tools with Mining tools
Knowledge Discovery (Disease Gene Association)
Hypotheses Generation
Automatic Data Interpretation