Title: Learning to Extract Proteins and their Interactions from Medline Abstracts
1Learning to Extract Proteins and their
Interactions from Medline Abstracts
Raymond J. Mooney Department of Computer Sciences
Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Yuk
Wah Wong
Edward M. Marcotte, Arun Ramani
Department of Computer Sciences
Institute for Cellular and Molecular Biology
University of Texas at Austin
2Biological Motivation
- Human Genome Project has produced huge amounts of
genetic data. - Next step is analyzing and interpreting this data.
3(No Transcript)
4Starting at the tip of chromosome 1...
1 taaccctaac cctaacccta accctaaccc
taaccctaac cctaacccta accctaaccc 61
taaccctaac cctaacccta accctaaccc taaccctaac
cctaacccaa ccctaaccct 121 aaccctaacc
ctaaccctaa ccctaacccc taaccctaac cctaacccta
accctaacct 181 aaccctaacc ctaaccctaa
ccctaaccct aaccctaacc ctaaccctaa cccctaaccc
241 taaccctaaa ccctaaaccc taaccctaac cctaacccta
accctaaccc caaccccaac 301 cccaacccca
accccaaccc caaccctaac ccctaaccct aaccctaacc
ctaccctaac 361 cctaacccta accctaaccc
taaccctaac ccctaacccc taaccctaac cctaacccta
421 accctaaccc taaccctaac ccctaaccct aaccctaacc
ctaaccctcg cggtaccctc 481 agccggcccg
cccgcccggg tctgacctga ggagaactgt gctccgcctt
cagagtacca 541 ccgaaatctg tgcagaggac
aacgcagctc cgccctcgcg gtgctctccg ggtctgtgct
601 gaggagaacg caactccgcc ggcgcaggcg cagagaggcg
cgccgcgccg gcgcaggcgc 661 agacacatgc
tagcgcgtcg gggtggaggc gtggcgcagg cgcagagagg
cgcgccgcgc 721 cggcgcaggc gcagagacac
atgctaccgc gtccaggggt ggaggcgtgg cgcaggcgca
781 gagaggcgca ccgcgccggc gcaggcgcag agacacatgc
tagcgcgtcc aggggtggag 841 gcgtggcgca
ggcgcagaga cgcaagccta cgggcggggg ttgggggggc
gtgtgttgca 901 ggagcaaagt cgcacggcgc
cgggctgggg cggggggagg gtggcgccgt gcacgcgcag
961 aaactcacgt cacggtggcg cggcgcagag acgggtagaa
cctcagtaat ccgaaaagcc 1021 gggatcgacc
gccccttgct tgcagccggg cactacagga cccgcttgct
cacggtgctg 1081 tgccagggcg ccccctgctg
gcgactaggg caactgcagg gctctcttgc ttagagtggt ...
5641 gctccagggc ccgctcacct tgctcctgct
ccttctgctg ctgcttctcc agctttcgct 5701
ccttcatgct gcgcagcttg gccttgccga tgcccccagc
ttggcggatg gactctagca 5761 gagtggccag
ccaccggagg ggtcaaccac ttccctggga gctccctgga
ctggagccgg 5821 gaggtgggga acagggcaag
gaggaaaggc tgctcaggca gggctgggga agcttactgt
5881 gtccaagagc ctgctgggag ggaagtcacc tcccctcaaa
cgaggagccc tgcgctgggg 5941 aggccggacc
tttggagact gtgtgtgggg gcctgggcac tgacttctgc
aaccacctga 6001 gcgcgggcat cctgtgtgca
gatactccct gcttcctctc tagcccccac cctgcagagc
6061 tggacccctg agctagccat gctctgacag tctcagttgc
acacacgagc cagcagaggg 6121 gttttgtgcc
acttctggat gctagggtta cactgggaga cacagcagtg
aagctgaaat 6181 gaaaaatgtg ttgctgtagt
ttgttattag accccttctt tccattggtt taattaggaa
6241 tggggaaccc agagcctcac ttgttcaggc tccctctgcc
ctagaagtga gaagtccaga 6301 gctctacagt
ttgaaaacca ctattttatg aaccaagtag aacaagatat
ttgaaatgga 6361 aactattcaa aaaattgaga
atttctgacc acttaacaaa cccacagaaa atccacccga
6421 gtgcactgag cacgccagaa atcaggtggc ctcaaagagc
tgctcccacc tgaaggagac 6481 gcgctgctgc
tgctgtcgtc ctgcctggcg ccttggccta caggggccgc
ggttgagggt 6541 gggagtgggg gtgcactggc
cagcacctca ggagctgggg gtggtggtgg gggcggtggg
6601 ggtggtgtta gtaccccatc ttgtaggtct gaaacacaaa
gtgtggggtg tctagggaag... and 3x109 more...
5Proteomics 101
- Genes code for proteins.
- Proteins are the basic components of biological
machinery. - Proteins accomplish their functions by
interacting with other proteins. - Knowledge of protein interactions is fundamental
to understanding gene function. - Chains of interactions compose large, complex
gene networks.
6Sample Gene Network
7Yeast Gene Network
Yeast
5,800 genes
5,800 proteins x 2-10 interactions/protein
12,000 - 60,000 interactions
10-20,000 knowngt 1/3 of the way to a complete
map!
8Human Gene Network
40,000 genes
gtgt40,000 proteins x 2-10 interactions/protein
lt5,000 known gt approx. 1 of the complete map!
gtgt80,000 - 400,000 interactions
gt Were a long ways from the complete map
9Relevant Sources of Data
Biological literature 14 million
documents DNA sequence data 1010
nucleotides Gene expression data 108
measurements, but... DNA polymorphisms 107
known Gene inactivation (knockout) studies
105 Protein structure data 104 structures
Protein interaction data 104 interactions,
but Protein expression data 104 measurements,
but... Protein location data 104 measurements
10Knowledge in Biomedical Literature
- An ever increasing wealth of biological
information is present in millions of published
articles but retrieving it in structured form is
difficult. - Much of this literature is available through the
NIH -NLMs Medline (PubMed) repository. - 11 million abstracts in electronic form are
available through Medline. - Excellent source of information on protein
interactions.
11Obtaining Protein Interactions from Medline
- Reactome, BIND, HPRD databases with protein
interactions manually curated from Medline. - Many interactions from Medline are not covered by
current databases. - Need automated information extraction to easily
locate and structure this information.
We integrated these databases, removing duplicates
12Sample Medline Abstract
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene. Other recent studies
have identified human cyclin D1 (PRAD1) as a
putative G1 cyclin and candidate
proto-oncogene. However, the specific enzymatic
activities and, hence, the precise biochemical
mechanisms through which cyclins function to
govern cell cycle progression remain
unresolved. In the present study we have
investigated the coordinate interactions between
these two potentially oncogenic cyclins,
cyclin-dependent protein kinase subunits (cdks)
and the Rb tumor-suppressor protein. The
distribution of cyclin D isoforms was modulated
by serum factors in primary fetal rat lung
epithelial cells. Moreover, cyclin D1 was found
to be phosphorylated on tyrosine residues in vivo
and, like cyclin A, was readily phosphorylated by
pp60c-src in vitro. In synchronized human
osteosarcoma cells, cyclin D1 is induced in early
G1 and becomes associated with p9Ckshs1, a
Cdk-binding subunit. Immunoprecipitation
experiments with human osteosarcoma cells and
Ewings sarcoma cells demonstrated that cyclin D1
is associated with both p34cdc2 and p33cdk2, and
that cyclin D1 immune complexes exhibit
appreciable histone H1 kinase activity. Immobilize
d, recombinant cyclins A and D1 were found to
associate with cellular proteins in complexes
that contain the p105Rb protein. This study
identifies several common aspects of cyclin
biochemistry, including tyrosine phosphorylation
and the potential to interact directly or
indirectly with the Rb protein, that may
ultimately relate membrane-mediated signaling
events to the regulation of gene expression.
13Sample Medline Abstract
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene. Other recent studies
have identified human cyclin D1 (PRAD1) as a
putative G1 cyclin and candidate
proto-oncogene. However, the specific enzymatic
activities and, hence, the precise biochemical
mechanisms through which cyclins function to
govern cell cycle progression remain
unresolved. In the present study we have
investigated the coordinate interactions between
these two potentially oncogenic cyclins,
cyclin-dependent protein kinase subunits (cdks)
and the Rb tumor-suppressor protein. The
distribution of cyclin D isoforms was modulated
by serum factors in primary fetal rat lung
epithelial cells. Moreover, cyclin D1 was found
to be phosphorylated on tyrosine residues in vivo
and, like cyclin A, was readily phosphorylated by
pp60c-src in vitro. In synchronized human
osteosarcoma cells, cyclin D1 is induced in early
G1 and becomes associated with p9Ckshs1, a
Cdk-binding subunit. Immunoprecipitation
experiments with human osteosarcoma cells and
Ewings sarcoma cells demonstrated that cyclin D1
is associated with both p34cdc2 and p33cdk2, and
that cyclin D1 immune complexes exhibit
appreciable histone H1 kinase activity. Immobilize
d, recombinant cyclins A and D1 were found to
associate with cellular proteins in complexes
that contain the p105Rb protein. This study
identifies several common aspects of cyclin
biochemistry, including tyrosine phosphorylation
and the potential to interact directly or
indirectly with the Rb protein, that may
ultimately relate membrane-mediated signaling
events to the regulation of gene expression.
14Sample Medline Abstract
15Manually Developed IE Systems for Medline
- A number of projects have focused on the manual
development of information extraction (IE)
systems for biomedical literature. - KeX for extracting protein names (Fukuda et al.,
1998) Extract words with special symbols
excluding those with more than half of the
characters being special symbols, hence
eliminating strings such as /-. - Suiseki for extracting protein interactions
(Blaschke et al., 2001) PROT (0-2) PROT (0-2)
complex NOUN between (0-3) PROT (0-3) and (0-3)
PROT
16Learning Information Extractors
- Manually developing IE systems is tedious and
time-consuming and they do not capture all
possible formats and contexts for the desired
information. - Machine learning from supervised corpora, is
becoming the standard approach to building
information extractors. - Recently, several learning approaches have been
applied to Medline extraction (Craven Kumlein,
1999 Tanabe Wilbur, 2002 Raychaudhuri et al.,
2002). - We have explored the use of a variety of machine
learning techniques to develop IE systems for
extracting human protein names and interactions,
presenting uniform results on a single,
reasonably large, human-annotated corpus.
17Framework for Interaction Extraction
- Traditionally, the task has two separate steps
Protein _Extraction and Interaction
Extraction.
Medline abstract
- Extensive comparative experiments in Bunescu et
al. 2005 - Protein Extraction Maximum Entropy tagger.
- Interaction Extraction ELCS (Extraction using
Longest Common Subsequences).
18Non-Learning Protein Extractors
- Dictionary-based extraction
- KEX (Fukuda et al., 1998)
19Learning Methods for Protein Extraction
- Rule-based pattern induction
- Rapier (Califf Mooney, 1999)
- BWI (Freitag Kushmerick, 2000)
- Token classification (chunking approach)
- K-nearest neighbor
- Transformation-Based Learning Abgene
(Tanabe Wilbur, 2002) - Support Vector Machine
- Maximum entropy
- Hidden Markov Models
- Conditional Random Fields (Lafferty, McCallum,
and Pereira, 2001) - Relational Markov Networks (Taskar, Abbeel, and
Koller, 2002)
20Name Extraction by Token Classification(Chunking
or Sequence Labeling Approach)
- Since in our data no protein names directly abut
each other, we can reduce the extraction problem
to classification of individual words as being
part of a protein name or not. - Protein names are extracted by identifying the
longest sequences of words classified as being
part of a protein name.
Two potentially oncogenic cyclins , cyclin A and
cyclin D1 , share common properties of subunit
configuration , tyrosine phosphorylation and
physical association with the Rb protein
21Constructing Feature Vectors for Classification
- For each token, we take the following as
features - Current token
- Last 2 tokens and next 2 tokens
- Output of dictionary-based tagger for these 5
tokens - Suffix for each of the 5 tokens (last 1, 2, and 3
characters) - Class labels for last 2 tokens
Two potentially oncogenic cyclins , cyclin A and
cyclin D1 , share common properties of subunit
configuration , tyrosine phosphorylation and
physical association with the Rb protein
22Our Biomedical Corpora (AIMed)
- 750 abstracts that contain the word human were
randomly chosen from Medline for testing protein
name extraction. They contain a total of 5,206
protein references. - 200 abstracts previously known to contain protein
interactions were obtained from the Database of
Interacting Proteins. They contain 1,101
interactions and 4,141 protein names. - As negative examples for interaction extraction
are rare, an extra set of 30 abstracts containing
sentences with non-interacting proteins are
included. - The resulting 230 abstracts are used for testing
protein interaction extraction.
23The Yapex Corpus
- 200 abstracts from Medline, manually tagged for
protein names. - 147 randomly chosen such that they contain the
Mesh terms protein binding, interaction,
molecular. - 53 randomly chosen from the GENIA corpus
http//www.sics.se/humle/projects/prothalt/
24Evaluation Metrics for Information Extraction
- Precision is the percentage of extracted items
that are correct. - Recall is the percentage of correct items that
are extracted. - Extracted protein names are considered correct if
the same character sequences have been
human-tagged as protein names in the exact
positions. - Extracted protein interactions from an abstract
are considered correct if both proteins have been
human-tagged as interacting in that abstract.
Positions are not taken into account.
25Experimental Method
- 10-fold cross-validation Average results over 10
trials with different training and (independent)
test data. - For methods which produce confidence in
extractions, vary threshold for extraction in
order to explore recall-precision trade-off. - Use standard methods from information-retrieval
to generate a complete precision-recall curve. - Maximizing F-measure assumes a particular
cost-benefit trade-off between incorrect and
missed extractions.
26Protein Name Extraction Results(Bunescu et al.,
2004)
27Graphical Models
Probabilistic models that represent dependencies
using a graph
- Directed Models gt well suited to represent
temporal and causal relationships (Bayesian
Networks, HMMs) - Undirected Models gt appropriate for
representing statistical correlation between
variables (Markov Networks) - Generative Models gt define a joint probability
over observations and labels (HMMs) - Discriminative Models gt specifies a probability
over labels given a set of observations
(Conditional Random Fields)
28Conditional Random FieldsLafferty, McCallum
Pereira 2001
- CRFs are a type of discriminative Markov
bnetworks used for
labeling sequences. - CRFs have shown superior or competitive
bperformance in various tasks as - Shallow Parsing
- Entity Recognition
- Table Extraction
Sha Pereira 2003
McCallum Li 2003
Pinto et al 2003
29Conditional Random Fields (CRFs)
- Tj.tag the tag at position j
- Tj.w true if word w occurs at position j
- Tj.cap true if word at position j begins with
capital letter,
30Protein Name Extraction Results (Yapex)
31Collective Classification of Web Pages
Taskar, Abbeel Koller 2002
32Collective Information Extraction
The control of human ribosomal protein L22 (
rpL22 ) to enter into the nucleolus and its
ability to be assembled into the ribosome is
regulated by its sequence . The nuclear import of
rpL22 depends on a classical nuclear
localization signal of four lysines at positions
13 16 Once it reaches the nucleolus , the
question of whether rpL22 is assembled into the
ribosome depends upon the presence of the N -
domain .
e3
of rpL22 depends
repetition
e1
e2
e5
acronym
overlap
repetition
repetition
ribosomal protein L22
( rpL22 )
L22
whether rpL22 is
e4
33Relational Markov Networks
Taskar, Abbeel Koller 2002
Discriminative Markov Networks, augmented with
clique templates
34Experimental Results
- Compared three approaches
- LTRMN ? RMN extraction using local templates
_ Overlap Template - GLTRMN ? RMN extraction using both local and
bglobal templates. - CRF ? extraction as token classification using
_Conditional Random Fields
35Experimental Results Yapex
36Experimental Results AIMed
37Protein Interaction Extraction
- Most IE methods focus on extracting individual
entities. - Protein interaction extraction requires
extracting relations between entities. - Our current results on relation extraction have
focused on rule-based and kernel-based learning
approaches.
38ELCS (Extraction using Longest Common
Subsequences)
- A new method for inducing rules that extract
interactions between previously tagged proteins. - Each rule consists of a sequence of words with
allowable word gaps between them (similar to
Blaschke Valencia, 2001, 2002). - (7)
interactions (0) between (5) PROT (9) PROT (17)
. - Any pair of proteins in a sentence if tagged as
interacting forms a positive example, otherwise
it forms a negative example. - Positive examples are repeatedly generalized to
form rules until the rules become overly general
and start matching negative examples.
39Generalizing Rules using Longest Common
Subsequence
The self - association site appears to be formed
by interactions between helices 1 and 2 of beta
spectrin repeat 17 of one dimer with helix 3 of
alpha spectrin repeat 1 of the other dimer to
form two combined alpha - beta triple - helical
segments . Title - Physical and functional
interactions between the transcriptional
inhibitors Id3 and ITF-2b .
40Protein Interaction Extraction Results(gold-stand
ard protein tags)
41Protein Interaction Extraction Results(automated
protein tags)
42Large-Scale Text Mining
- Apply trained extractors to 753,459 Medline
abstracts that reference human. - Automatically mine large numbers of protein
interactions from this scientific text. - Integrate extracted data with existing databases
to construct the worlds largest database of
human protein interactions. - How judge accuracy of extracted interactions?
43Accuracy Benchmark Shared Functional Annotations
- Accuracy of a protein interaction dataset
correlates well with of interaction partners
sharing functional annotations. - Functional annotation ? a pathway between the two
proteins in a particular ontology - KEGG 55 pathways at lowest level.
- GO 1356 pathways at level 8 of biological
process annotation.
44Accuracy Benchmarks LLR Scoring Scheme
- Use the log-likelihood ratio (LLR) of protein
pairs
P(DI) and P(D?I) are the probabilities of
observing the interaction data D conditioned on
the proteins sharing (I) or not sharing (? I)
functional annotations.
- Higher values for LLR indicate higher accuracy.
45Interaction Extraction usingCo-citation Analysis
- Intuition proteins co-occurring in a large
number of abstracts tend to be interacting
proteins. - Compute the probability of co-citation under a
random model (hyper-geometric distribution).
N total number of abstracts (750K) n
abstracts citing the first protein m abstracts
citing the second protein k abstracts citing
both proteins
46Interaction Extraction using Co-citation Analysis
- Protein pairs which co-occur in a large number of
abstracts (high k) are assigned a low probability
under the random model. - Empirically, protein pairs whose observed
co-citation rate is given a low probabilty under
the random model score high on the functional
annotation benchmark. - RESULT Close to 15K interactions extracted that
score comparable or better than HPRD on the
functional annotation benchmark.
47Co-citation Analysis with Bayesian Reranking
- Use a trained Naïve Bayes model to measure the
likelihood that an abstract discusses physical
protein interactions. - For a given pair of proteins, compute the average
score of co-citing abstracts. - Use the average score to re-rank the 15k already
extracted pairs.
Medline abstract
48Integrating Extracted Data with Existing
Databases
- Extracted 6,580 interactions between 3,737 human
proteins - Total 31,609 interactions between 7,748 human
proteins.
49Filtered Co-citation Analysis Evaluation
50ERK (Extraction using a Relation Kernel)
- Use SVM with a string kernel.
- The patterns (features) are sparse subsequences
of words constrained to be anchored on the two
protein names. - The feature space can be further pruned down in
almost all examples, a sentence asserts a
relationship between two entities using one of
the following patterns - FI Fore-Inter interaction of P1 with P2,
activation of P1 by P2 - I Inter P1 interacts with P2, P1 is
activated by P2 - IA Inter-After P1 P2 complex, P1 and P2
interact - Restrict the three types of patterns to use at
most 4 words (besides the two protein anchors).
51ERK (Extraction using a Relation Kernel)
S1 ? In synchronized human osteosarcoma cells,
cyclin D1 is induced in early G1 and becomes
associated with p9Ckshs1, a Cdk-binding
subunit. S2 ? Experiments with human osteosarcoma
cells and Ewings sarcoma cells demonstrated that
cyclin D1 is associated with both p34cdc2 and
p33cdk2, and
- FI patterns human cells P1 associated with
P2, - I patterns P1 associated with P2,
- IA patterns P1 associated with P2 ,,
- The kernel K(S1,S2) ? the number of common
patterns between S1 and S2, weighted by their
span in the two sentences. - K(S1,S2) can be computed based on the dynamic
procedure from Lodhi et al., 2002. - Train an SVM model to find a max-margin linear
discriminator between positive and negative
examples
52Evaluation ERK vs ELCS vs Manual
53Future Work Conclusions
- Future Work
- Analyze the complete set of 750K abstracts using
the relational kernel and integrate results into
an improved composite dataset. - Conclusions
- Created a large database of interacting human
proteins by consolidating interactions
automatically extracted from Medline abstracts
with existing databases. - Final database
31,609 interactions
between 7,748 human proteins. -
54For Further Information
- Consolidated database available on line
- http//bioinformatics.icmb.utexas.edu/idserve/
- Papers available online
- http//www.cs.utexas.edu/users/ml/publication/bioi
nformatics.html - Consolidating the Set of Known Human
Protein-Protein Interactions in Preparation for
Large-Scale Mapping of the Human Interactome,
Ramani, A.K., Bunescu, R.C., Mooney, R.J. and
Marcotte, E.M.,Genome Biology, 6, 5, r40(2005). - Using Biomedical Literature Mining to
Consolidate the Set of Known Human
Protein-Protein Interactions,Arun Ramani, Edward
Marcotte, Razvan Bunescu, Raymond Mooney, to
appear in the Proceedings of ISMB BioLINK SIG
Linking Literature, Information and Knowledge for
Biology, Detroit, MI, June 2005. - Collective Information Extraction with
Relational Markov Networks, Razvan Bunescu and
Raymond J. Mooney, Proceedings of the 42nd Annual
Meeting of the Association for Computational
Linguistics (ACL-2004), pp. 439-446, Barcelona,
Spain, July 2004. - Comparative Experiments on Learning Information
Extractors for Proteins and their Interactions.,
Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Edward
M. Marcotte, Raymond J. Mooney, Arun Kumar
Ramani, and Yuk Wah Wong, Artificial Intelligence
in Medicine (Special Issue on Summarization and
Information Extraction from Medical Documents),
33, 2 (2005), pp. 139-155.