Title: Extracting Functional Annotations of Proteins Based on Hybrid Text Mining Approaches
1Extracting Functional Annotations of Proteins
Based on Hybrid Text Mining Approaches
- user9
- Jung-Hsien Chiang and Hsu-Chun Yu
- SUN Center of Excellence on Bioinformatics
- Department of Computer Science and Information
Engineering - National Cheng Kung University
- Tainan 701, Taiwan, ROC
- 2004/3/29
2Outline
- Extraction Procedure
- Competition Results
- Examples
- Conclusion
3Documents
Extraction Procedure
Sentence Detection
Sentence Indexing
Sentence Index
POS Tagging
GO Term Indexing
Protein Name Indexing
Protein Index
GO Index
Co-occurrence Extraction
Co-occurrence
Finite Automata
Phrase Parsing
Parsed Sentences
Phrasal Patterns
Pattern Matching
Template Screening
Sentence Classification
Templates
4Sentence Detection and Indexing
- The final submission format of the task required
the original text of the document that proves an
annotation. - Nevertheless, a sentence was tokenized after
sentence detection, i.e. the original text would
be modified. - Hence sentences were first indexed by recording
their positions in text, so that the system can
return to the original text, and provide the
evidence text.
5Indexing of GO Terms and Protein Names
- For the identification of GO terms and protein
names in text, we use an indexing method instead
of a tagging one. - Since a word in text may match more than one GO
term or protein name, even both, a method of
directly tagging the names in text does not work.
sentence_id
begin
end
term_id
6Co-occurrence Extraction
- Extracts sentences with the co-occurrence of a
protein name and a GO term. - In regard to implementation, since indices of
protein names and GO terms are stored in a
database, the system executes a SQL statement to
query the sentence IDs belonging to both a
protein name index and a GO term index.
Sentence
Protein
GO
Protein Index
GO Index
protein_index.sentence_id go_index.sentence_id
7Phrase Parsing
- We used a shallow parsing method only on those
sentences with the co-occurrence of a protein
name and a GO term. - The system recognizes noun and verb phrases using
finite automata, which model general forms of
phrase constructs.
8Phrase Parsing (Cont.)
- The Finite Automaton for Recognizing Noun Phrases
9Phrase Parsing (Cont.)
- ltphrase typegt lttokengt ltPOSgt ltslotgt
-
- NP These DT . results NNS .
- VP indicate VBP .
- ._ that IN .
- NP Pyk 2 NNP P
- VP is VBZ . involved VBN .
- ._ in IN .
- NP the DT . signal transduction NNP G
- VP pathway RB . leading VBG .
- ._ to TO .
- NP IL NNP . - .
- ._ 2 LS .
- NP production NN .
10Pattern Matching
- Biological process
-
- NP . . P
- VP plays . .
- NP role . .
- ._ . IN .
- NP . . G
-
- Cellular component
-
- NP . . P
- VP localizedcolocalizesimmunolocalized .
. - ._ . INTO .
- NP . . G
11Sentence Classification
- Naïve Bayes classifier predict the overall
likelihood that a sentence describes a
protein-function relation.
Sentence
Protein/GO
GO/Protein
Prefix
Suffix
Infix
12GO Variant Mining
- GO Variant
- Consisting of the same tokens but in a different
order. - Consisting of the major tokens but with a few
insertions or deletions of tokens
mitochondrial inner membrane
inner mitochondrial membrane
phenylalanine catabolism
catabolism of phenylalanine
13GO Variant Mining (Cont.)
- Edit distance the minimum number of insertions
or deletions of tokens necessary to make two
terms equal regardless of the token order. - Candidate GO Variants
- insertion lt 2 and deletion lt 2
14Template Screening
GO term indexing
GO term indexing
Output one of the templates
Output one of the templates
Yes
Yes
Extractable by the phrasal pattern approach?
Extractable by the phrasal pattern approach?
No
No
GO variant indexing
End
Yes
Extractable by the phrasal pattern approach?
No
End
15Template Screening (Cont.)
GO term indexing
Output the top one of the templates ranked by the
sentence classifier
Yes
Extractable by the phrasal pattern approach?
No
Yes
Extractable by the sentence classifier approach?
No
GO variant indexing
Yes
Extractable by the phrasal pattern approach?
No
Yes
Extractable by the sentence classifier approach?
No
End
16Competition Results
17ExampleSubmission
- ltproteingt
- ltnamefilegtJBC_2000-2/bc032260.gmllt/namefilegt
- ltidTaskgt2.1lt/idTaskgt
- ltparticipantgtJung-Hsien Chiang, Hsu-Chun
Yult/participantgt - ltnameProteingtRASGRF1lt/nameProteingt
- ltdbIdgtQ13972lt/dbIdgt
- ltsourceDbgtSwiss-Protlt/sourceDbgt
- ltgoCodegt
- ltnamegtplasma membranelt/namegt
- ltcodegt0005886lt/codegt
- ltevidenceTextgtIn contrast to Sos, RasGRF1 is
constitutively localized to the plasma membrane,
probably anchored through its N-terminal
pleckstrin homology domain (ltBBR
RID"B23"gt).lt/evidenceTextgt - lt/goCodegt
- lt/proteingt
18Example (Cont.) Parsed Sentence
-
- . In IN .
- NP contrast NN .
- . to TO .
- NP Sos NNP .
- . , , .
- NP RasGRF 1 NNP P
- VP is VBZ . constitutively RB .
localized VBN . - . to TO .
- NP the DT . plasma membrane NNP G
- . , , .
- VP probably RB . anchored VBD .
- . through IN .
- . its PRP .
- NP N NNP . - .
- NP terminal JJ . pleckstrin NN .
homology NN . - domain NN .
19Example (Cont.) Matched Pattern
-
- NP . . P
- VP localizedcolocalizesimmunolocalized .
. - ._ . INTO .
- NP . . G
-
- Document
20Conclusion
- We propose a mixture of phrasal pattern and
sentence classifier approaches to perform the
automatic assignment of GO annotations to
proteins. - We also use the GO variant mining method to
search for potential GO variants. This method can
broaden the coverage of GO term indexing.