Extracting Functional Annotations of Proteins Based on Hybrid Text Mining Approaches - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Extracting Functional Annotations of Proteins Based on Hybrid Text Mining Approaches

Description:

Extracting Functional Annotations of Proteins Based on Hybrid Text Mining Approaches ... Jung-Hsien Chiang and Hsu-Chun Yu. SUN Center of Excellence on Bioinformatics ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 21
Provided by: pdgCn
Category:

less

Transcript and Presenter's Notes

Title: Extracting Functional Annotations of Proteins Based on Hybrid Text Mining Approaches


1
Extracting Functional Annotations of Proteins
Based on Hybrid Text Mining Approaches
  • user9
  • Jung-Hsien Chiang and Hsu-Chun Yu
  • SUN Center of Excellence on Bioinformatics
  • Department of Computer Science and Information
    Engineering
  • National Cheng Kung University
  • Tainan 701, Taiwan, ROC
  • 2004/3/29

2
Outline
  • Extraction Procedure
  • Competition Results
  • Examples
  • Conclusion

3
Documents
Extraction Procedure
Sentence Detection
Sentence Indexing
Sentence Index
POS Tagging
GO Term Indexing
Protein Name Indexing
Protein Index
GO Index
Co-occurrence Extraction
Co-occurrence
Finite Automata
Phrase Parsing
Parsed Sentences
Phrasal Patterns
Pattern Matching
Template Screening
Sentence Classification
Templates
4
Sentence Detection and Indexing
  • The final submission format of the task required
    the original text of the document that proves an
    annotation.
  • Nevertheless, a sentence was tokenized after
    sentence detection, i.e. the original text would
    be modified.
  • Hence sentences were first indexed by recording
    their positions in text, so that the system can
    return to the original text, and provide the
    evidence text.

5
Indexing of GO Terms and Protein Names
  • For the identification of GO terms and protein
    names in text, we use an indexing method instead
    of a tagging one.
  • Since a word in text may match more than one GO
    term or protein name, even both, a method of
    directly tagging the names in text does not work.

sentence_id
begin
end
term_id
6
Co-occurrence Extraction
  • Extracts sentences with the co-occurrence of a
    protein name and a GO term.
  • In regard to implementation, since indices of
    protein names and GO terms are stored in a
    database, the system executes a SQL statement to
    query the sentence IDs belonging to both a
    protein name index and a GO term index.

Sentence
Protein
GO
Protein Index
GO Index
protein_index.sentence_id go_index.sentence_id
7
Phrase Parsing
  • We used a shallow parsing method only on those
    sentences with the co-occurrence of a protein
    name and a GO term.
  • The system recognizes noun and verb phrases using
    finite automata, which model general forms of
    phrase constructs.

8
Phrase Parsing (Cont.)
  • The Finite Automaton for Recognizing Noun Phrases

9
Phrase Parsing (Cont.)
  • ltphrase typegt lttokengt ltPOSgt ltslotgt
  • NP These DT . results NNS .
  • VP indicate VBP .
  • ._ that IN .
  • NP Pyk 2 NNP P
  • VP is VBZ . involved VBN .
  • ._ in IN .
  • NP the DT . signal transduction NNP G
  • VP pathway RB . leading VBG .
  • ._ to TO .
  • NP IL NNP . - .
  • ._ 2 LS .
  • NP production NN .

10
Pattern Matching
  • Biological process
  • NP . . P
  • VP plays . .
  • NP role . .
  • ._ . IN .
  • NP . . G
  • Cellular component
  • NP . . P
  • VP localizedcolocalizesimmunolocalized .
    .
  • ._ . INTO .
  • NP . . G

11
Sentence Classification
  • Naïve Bayes classifier predict the overall
    likelihood that a sentence describes a
    protein-function relation.

Sentence
Protein/GO
GO/Protein
Prefix
Suffix
Infix
12
GO Variant Mining
  • GO Variant
  • Consisting of the same tokens but in a different
    order.
  • Consisting of the major tokens but with a few
    insertions or deletions of tokens

mitochondrial inner membrane
inner mitochondrial membrane
phenylalanine catabolism
catabolism of phenylalanine
13
GO Variant Mining (Cont.)
  • Edit distance the minimum number of insertions
    or deletions of tokens necessary to make two
    terms equal regardless of the token order.
  • Candidate GO Variants
  • insertion lt 2 and deletion lt 2

14
Template Screening
  • Strategy 1
  • Strategy 2

GO term indexing
GO term indexing
Output one of the templates
Output one of the templates
Yes
Yes
Extractable by the phrasal pattern approach?
Extractable by the phrasal pattern approach?
No
No
GO variant indexing
End
Yes
Extractable by the phrasal pattern approach?
No
End
15
Template Screening (Cont.)
  • Strategy 3

GO term indexing
Output the top one of the templates ranked by the
sentence classifier
Yes
Extractable by the phrasal pattern approach?
No
Yes
Extractable by the sentence classifier approach?
No
GO variant indexing
Yes
Extractable by the phrasal pattern approach?
No
Yes
Extractable by the sentence classifier approach?
No
End
16
Competition Results
17
ExampleSubmission
  • ltproteingt
  • ltnamefilegtJBC_2000-2/bc032260.gmllt/namefilegt
  • ltidTaskgt2.1lt/idTaskgt
  • ltparticipantgtJung-Hsien Chiang, Hsu-Chun
    Yult/participantgt
  • ltnameProteingtRASGRF1lt/nameProteingt
  • ltdbIdgtQ13972lt/dbIdgt
  • ltsourceDbgtSwiss-Protlt/sourceDbgt
  • ltgoCodegt
  • ltnamegtplasma membranelt/namegt
  • ltcodegt0005886lt/codegt
  • ltevidenceTextgtIn contrast to Sos, RasGRF1 is
    constitutively localized to the plasma membrane,
    probably anchored through its N-terminal
    pleckstrin homology domain (ltBBR
    RID"B23"gt).lt/evidenceTextgt
  • lt/goCodegt
  • lt/proteingt

18
Example (Cont.) Parsed Sentence
  • . In IN .
  • NP contrast NN .
  • . to TO .
  • NP Sos NNP .
  • . , , .
  • NP RasGRF 1 NNP P
  • VP is VBZ . constitutively RB .
    localized VBN .
  • . to TO .
  • NP the DT . plasma membrane NNP G
  • . , , .
  • VP probably RB . anchored VBD .
  • . through IN .
  • . its PRP .
  • NP N NNP . - .
  • NP terminal JJ . pleckstrin NN .
    homology NN .
  • domain NN .

19
Example (Cont.) Matched Pattern
  • NP . . P
  • VP localizedcolocalizesimmunolocalized .
    .
  • ._ . INTO .
  • NP . . G
  • Document

20
Conclusion
  • We propose a mixture of phrasal pattern and
    sentence classifier approaches to perform the
    automatic assignment of GO annotations to
    proteins.
  • We also use the GO variant mining method to
    search for potential GO variants. This method can
    broaden the coverage of GO term indexing.
Write a Comment
User Comments (0)
About PowerShow.com