TM and NLP for Biology Research Issues in HPSG Parsing - PowerPoint PPT Presentation

About This Presentation
Title:

TM and NLP for Biology Research Issues in HPSG Parsing

Description:

Full-strength Straufen protein lacking this insertion is able to assocaite ... 1. Chart parsing by using initial LE assignment. 2. Extend LE assignment when ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 66
Provided by: ltrcI
Category:

less

Transcript and Presenter's Notes

Title: TM and NLP for Biology Research Issues in HPSG Parsing


1
TM and NLP for BiologyResearch Issues in HPSG
Parsing
  • Junichi TSUJII

Department of Computer Science School of
Information Science and Technology University
of Tokyo, JAPAN
School of Computer Science National Centre for
Text Mining University of Manchester, UK
2
G-protein coupled receptor
D.L.Banville 2006
2005 14,000 papers
Before 1988 9 papers
1992 256 papers
500 times more
2002
2000
1998
1992
1994
1996
1990
1988
1980
1982
1984
1986
1978
1970
1972
1974
1976
1968
1966
1964
Increase in Medline
3
NaCTeMwww.nactem.ac.uk
  • First such centre in the world
  • Funding JISC, BBSRC, EPSRC
  • Consortium investment
  • Chair in TM (Prof. J. Tsujii, Univ. Tokyo)
  • Location Manchester Interdisciplinary Biocentre
    (MIB) www.mib.ac.uk funded by the Wellcome Trust
  • Initial focus biomedical academic community
  • Extend services to industry
  • Extend focus to other domains (social sciences)

4
Consortium
  • Universities of Manchester, Liverpool
  • Service activity run by MIMAS (National Centre
    for Dataset Services), within MC (Manchester
    Computing)
  • Self-funded partners
  • San Diego Supercomputing Center
  • University of California, Berkeley
  • University of Geneva
  • University of Tokyo
  • Strong industrial academic support
  • IBM, AZ, EBI, Wellcome Trust, Sanger Institute,
    Unilever, NowGEN, MerseyBio,

5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
NLP and TM
Linking text with knowledge
Text Mining Text as a bag of words Words as
surface strings
Natural Language Processing Language as a
complex system linking surface strings of
characters with their meanings Text and words
as structured objects
NLP-based TM
12
From surface diversities and ambiguities to
conceptual invariants
Non-Trivial Mappings
Terminology Parsing Paraphrasing
Knowledge Domain
Language Domain
Concepts and Relationships among Them
Linguistic expressions
Motivated Independently of language
13
Example
14
A protein activates B (Pathway extraction)
Since ., we postulate that only
phosphorylated PHO2 protein could activate the
transcription of PHO5 gene.
Transcription initiation by the sigma(54)-RNA
polymerase holoenzyme requires an
enhancer-binding protein that is thought to
contact sigma(54) to activate transcription.
Full-strength Straufen protein lacking this
insertion is able to assocaite with osker mRNA
and activate its translation, but fails to ..
Retrieval using Regional Algebra
sentence gt (arg1_activate gt protein)
Non-trivial Mapping
Same relations with different Structures
Language Domain
Knowledge Domain
Independently motivated of Language
15
Predicate-argument structureParser based on
Probabilistic HPSG (Enju)
S
VP
VP
VP
S
VP
arg3
arg1
NP
ADVP
NP
arg2
arg2
p53 has been shown to directly
activate the Bcl-2 protein
16
??/?????HPSG??? (Enju)???
s
Semantic Retrieval System Using Deep Syntax MEDIE
vp
vp
np
pp
arg2
arg1
mod
dt np vp vp pp
np
DT NN VBZ VBN IN PRP
The protein is activated by it
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Demos
  • MEDIE
  • Info-PubMed

27
Predicate-argument structureParser based on
Probabilistic HPSG (Enju)
S
VP
VP
VP
S
VP
arg3
arg1
NP
ADVP
NP
arg2
arg2
p53 has been shown to directly
activate the Bcl-2 protein
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Performance of Semantic Parser
Penn Treebank GENIA
Coverage 99.7 99.2
F-Value (PArelations) 87.4 86.4
Sentence Precison 39.2 31.8
Processing Time 0.68sec 1.00sec
32
Scalability of TM Tools
Target Corpus MEDLINE corpus
The number of papers 14,792,890
The number of abstracts 7,434,879
The number of sentences 70,815,480
The number of words 1,418,949,650
Compressed data size 3.2GB
Uncompressed data size 10GB
Suppose, for example, that it takes one second
for parsing one sentence.
70 million seconds, that is, about 2 years
33
TM and GRID
  • Solution
  • The entire MEDLINE were parsed by distributed PC
    clusters consisting of 340 CPUs
  • Parallel processing was managed by grid platform
    GXP Taura2004
  • Experiments
  • The entire MEDLINE was parsed in 8 days
  • Output
  • Syntactic parse trees and predicate argument
    structures in XML format
  • The data sizes of compressed/uncompressed output
    were 42.5GB/260GB.

34
Efficient Parsing for HPSG
35
Background HPSG
  • Head-Driven Phrase Structure Grammar (HPSG)
    Pollard and Sag, 1994
  • Lexicalized and Constraints-based Grammar
  • A few Rule Schema? General constraints on
    linguistic constructions
  • Constraints embedded in Lexicon? Word-Specific
    Constraints
  • Constraints between phrase structures and
    semantic structures

36
Parsing by HPSG
I
like
it
37
Parsing by HPSG
Assignment of Lexical Entries
38
Application of Rule Schema
Head-Complement
2
39
Application of Rule Schema
Subject-Head
2
1
40
Inefficiency of HPSG Parsing
  • Complex DAGTyped-feature structures
  • Abstract machine for Unification (LiLFeS)
  • Unification Expensive Operation (?CFG
    Approximation CFG Filtering)
  • Assignment of Lexical Entries
  • High reduction of search space / Super tagging

41
Filtering with CFG (1/5)
  • 2-phased parsing
  • Approximate HPSG with CFG with keeping important
    constraints.
  • Obtained CFG might over-generate, but can be used
    in filtering.
  • Rewriting in CFG is far less expensive than that
    of application of rule schemata, principles and
    so on.

Feature Structures
HPSG

Compile
CFG
Input Sentences
Built-in CFG Parser
LiLFeS Unification
Parsing
Output
Complete parse trees
42
Inefficiency of HPSG Parsing
  • Complex DAGTyped-feature structures
  • Abstract machine for Unification (LiLFeS)
  • Unification Expensive Operation (?CFG
    Approximation CFG Filtering)
  • Assignment of Lexical Entries
  • High reduction of search space / Super tagging

43
HPSG and Parsing
Most of Constraints are in LEs
Assignment of LEs Parsing results are
implicitly determined
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
like
44
Supertagging and Efficient Parsing Clark and
Curran, 2004 Ninomiya et al., 2006
  • Supertagging
  • P(Seq-Of-LEs SeQ-Of-Words)
  • Selection of lexical entry assignments
  • Bangalore and Joshi, 1999

Threshold
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
High Probability
it
I
45
Chart parsing
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
it
I
46
Efficient Parser
  • Smaller Number of LE assignments
  • LE assignments that lead to complete parse trees

Previous methods1. Chart parsing by using
initial LE assignment 2. Extend LE assignment
when parsing fails
47
System Overview
Input sentence
I like it
CFG Filtering
Supertagger
Deterministic Shift/Reduce Parser
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD nounSUBJ lt gtCOMPS lt gt
HEAD verbSUBJ ltNPgtCOMPS ltNPgt
P High
I
it
48
Experiment Results
LP() LR() F1() Avg. time
Staged/Deterministic model 86.93 86.47 86.70 30ms/snt
Previous method 1 (SupertaggerChartParser) 87.35 86.29 86.81 183ms/snt
Previous method 2 (Unigram ChartParser) 84.96 84.25 84.60 674ms/snt
6 times faster 20 times faster than the initial
model
49
Domain/Text Type Adaptation
50
F-score Training Time (Sec)
Baseline(PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining (GENIA) 88.45 14,695
Retraining(PTBGENIA)) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
51
Adaptation with Reference Distribution
Lexical Assignment
Syntactic Preference
Feature function
Feature weight
Original model
52
90
89
88
score
87
-
Baseline (PTB)
F
86
Simple Retraining (GENIA)
Retraining (GENIAPTB)
85
Structure with Ref.Dist
Lexical with RefDist
84
Lexical/Structure woth RefDist
83
0
2000
4000
6000
8000
Number of Sentence of the GENIA Training Set
53
90
89
88
score
87
-
F
86
Structure with RefDist
85
Lexicon woth RefDist
Lex/Str with RefDist
84
83
0
10000
20000
30000
Training Time (Sec)
54
F-score Training Time (Sec)
Baseline(PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining (GENIA) 88.45 14,695
Retraining(PTBGENIA)) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
55
Tool1 POS Tagger
The peri-kappa B site mediates human
immunodeficiency DT NN NN NN VBZ
JJ NN virus type 2 enhancer
activation in monocytes NN NN CD
NN NN IN NNS
  • General-Purpose POS taggers, trained by WSJ
  • Brills tagger, TnT tagger, MX POST, etc.
  • 97
  • General-Purpose POS taggers do not work well for
    MEDLINE abstracts

56
Errors seen in TnT tagger (Brants 2000)
A chromosomal translocation in DT
JJ NN IN and
membrane potential after mitogen binding. CC
NN NN IN NN JJ
two factors, which bind to the same kappa B
enhancers CD NNS WDT NN TO DT JJ
NN NN NNS by analysing the Ag amino
acid sequence. IN VBG DT VBG JJ
NN NN to contain more T-cell determinants
than TO VB RBR JJ NNS
IN Stimulation of interferon beta gene
transcription in vitro by NN IN
JJ JJ NN NN IN NN
IN
57
Performance of GENIA Tagger
  • GENIA tagger

(Ref.) TnT tagger
Training corpus WSJ GENIA
WSJ 97.0 84.3
GENIA 75.2 98.1
WSJGENIA 96.9 98.1
Training corpus WSJ GENIA
WSJ 96.7 84.3
GENIA 80.1 97.9
WSJGENIA 96.5 97.5
Some degradations (0.2 0.4) were observed,
compared with the taggers trained by pure
corpora
No degradation of the tagger trained by the mixed
corpus
58
CRF-based POS Active Learning GENIA
3,000 sentences 98.4 20,000 sentences 98.58
59
CRF-based POS Active Learning PTB
10,000 sentences 96.76 Best Performance 97.18
60
Applications
61
Our Policy for Information Extraction
  • Separate a domain/task-independent part from a
    domain/task-specific part.

IE System
Task-independent
Task-specific
62
Our Policy
  • Separate a domain/task-independent part from a
    domain/task-specific part.

PAS Predicate-Argument Structure
IE System
a full parser normalizes sentencesinto PASs
Task-independent
Task-specific
extraction rules on PASs
63
Our Policy
  • Distinguish a domain-independent part from a
    domain-specific part.

PAS Predicate-Argument Structure
IE System
a full parser normalizes sentencesinto PASs
Task-independent
Task-specific
extraction rules on PASs
Learned automatically from a corpus
64
GENIA Event Annotation - example
ClueType
LinkTheme
LinkCause
ClueLoc
ClueType
  • For an identified event in the given sentence,
  • classify the type of events and record the text
    span giving the clue of it (ClueType).
  • identify the theme of the events and record the
    text span linking the theme to the event
    (LinkTheme).
  • identify the cause of the events and record the
    text span linking the cause to the event
    (LinkCause).
  • record the environment (location, time) of the
    events (ClueLoc, ClueTime).

65
Gene_expression
  • Theme patterns observed (2,958)
  • Protein 2,308
  • DNA 591
  • RNA 25
  • Peptide 4
  • Protein Protein 2
  • Erroneous 27
  • Keywords
  • coexpress, nonexpress, overexpress, express,
    biosynthesis, product, synthesize, constitute,

coexpression
66
Transcription
  • Theme patterns observed (929)
  • DNA 449
  • RNA 272
  • Protein 167
  • Peptide 2
  • Erroneous 22
  • Keyword
  • Transcrib, transcript, synthesi, express,

67
Localization
  • ClueLoc
  • NONE 241
  • nuclear 140
  • to the nucleus 12
  • into the nucleus 11
  • Cytoplasmic 8
  • in the cytoplasm 7
  • macrophages 5
  • nuclear in t lymphocytes 4
  • monocytes 4
  • in the nucleus 4
  • in the cytosol 4
  • in colostrum 4
  • from the cytoplasm to the nucleus 4
  • Theme patterns observed (730)
  • Protein 608
  • Lipid 31
  • Atom 29
  • Other_organic_compound 14
  • DNA 12
  • Virus 5
  • Carbohydrate 5
  • RNA 4
  • Inorganic 4
  • Peptide 3
  • Keywords
  • Translocation, sectetion, release, localization,
    mobilization, uptake, secrete, import, transport,
    translocate, sequester, influx, mograte,
    localisation, move, delivery, export,

68
Localization
  • Keywords and Locations
  • translocation (166)
  • nuclear 108
  • NONE 38
  • secretion (100)
  • NONE 57
  • name_of_cells 43
  • release (80)
  • NONE 51
  • name_of_cells 19
  • localization (30)
  • nuclear 25
  • intracellular 3
  • uptake (24)
  • NONE 14
  • name_of_cells 20
  • Keywords and Themes
  • translocation (166)
  • Protein 161
  • Virus 4
  • RNA 1
  • secretion (100)
  • Protein 98
  • Lipid 1
  • Peptide 1
  • release (80)
  • Protein 67
  • Other_organic_compoun 6
  • Lipid 3
  • localization (30)
  • Protein 30
  • uptake (24)
  • Lipid 15
  • Carbohydrate 5
  • Protein 4

69
Future Plan
70
(No Transcript)
71
(No Transcript)
72
Future Directions
  • Domain Adaptation Inter-operability
  • High performance can be obtained by using domain
    specific characteristics and domain semantics
  • Differences among abstracts, full papers,
    comments in DBs
  • Standardized Interfaces (API) of NLP tools
  • Text Archives
  • Abstracts Full Papers Comments/Summary
    Descriptions in DBs
  • Combining NLP tools with Mining tools
  • Knowledge Discovery (Disease Gene Association)
  • Hypotheses Generation
  • Automatic Data Interpretation

73
Future Directions
  • Domain Adaptation Inter-operability
  • High performance can be obtained by using domain
    specific characteristics and domain semantics
  • Differences among abstracts, full papers,
    comments in DBs
  • Standardized Interfaces (API) of NLP tools
  • Text Archives
  • Abstracts Full Papers Comments/Summary
    Descriptions in DBs
  • Combining NLP tools with Mining tools
  • Knowledge Discovery (Disease Gene Association)
  • Hypotheses Generation
  • Automatic Data Interpretation
Write a Comment
User Comments (0)
About PowerShow.com