IntEx: A Syntactic Role Driven ProteinProtein Interaction Extractor for BioMedical Text - PowerPoint PPT Presentation

About This Presentation
Title:

IntEx: A Syntactic Role Driven ProteinProtein Interaction Extractor for BioMedical Text

Description:

... onto dsDNA ends and Ku can diffuse along the DNA in an. energy-independent manner. ... To tag new gene names, we used regular expressions (alpha numeric names, ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 29
Provided by: researchC
Category:

less

Transcript and Presenter's Notes

Title: IntEx: A Syntactic Role Driven ProteinProtein Interaction Extractor for BioMedical Text


1
IntEx A Syntactic Role Driven Protein-Protein
Interaction Extractor for Bio-Medical Text
  • Syed Toufeeq Ahmed
  • Deepthi Chidambaram
  • Hasan Davulcu
  • Chitta Baral

2
Outline
  • Introduction
  • Issues and Challenges
  • Our Approach (IntEx System)
  • Evaluation
  • Future Work
  • Conclusion
  • Demo

3
Introduction
  • Genomic Research in the last decade has resulted
    in humongous amount of data, and most of these
    findings are in form of free text.
  • PubMed/ MedLine has around 12 millions abstracts
    online.
  • An automated tool to extract information from
    free text (bio-medical) will be of great use to
    researchers (biologists).

4
Issues that make extraction difficult (Seymore,
McCallum et al.1999)
  • The task involves free text hence there are
    many ways of stating the same fact.
  • The genre of text is not grammatically simple.
  • The text includes a lot of technical terminology
    unfamiliar to existing natural language
    processing systems.
  • Information may need to be combined across
    several sentences.
  • There are many sentences from which nothing
    should be extracted.

5
Challenges
  • Interactions specified in different ways
  • HMBA inhibits MEC-1 cell proliferation.
  • GBMs commonly overexpress the oncogenes EGFR and
    PDGFR, and contain mutations and deletions of
    tumor suppressor genes PTEN and TP53.
  • Protein kinase B (PKB) has emerged as the focal
    point for many signal transduction pathways,
    regulating multiple cellular processes such as
    glucose metabolism, transcription, apoptosis,
    cell proliferation, angiogenesis, and cell
    motility.

6
Challenges (cont.)
  • Anaphora resolution
  • Pronominals It activates HMBA.
  • Sortal anaphora Both enzymes are
    phosphorylated.
  • Event anaphora This reaction acts in a
    mediated environment.
  • Multiple interactions in Complex sentences

Most of the tumor-suppressive properties of Pten
are dependent on its lipid phosphatase activity,
which inhibits the phosphatidylinositol-3'-kinase
(PI3K)/Akt signaling pathway through
dephosphorylation of phosphatidylinositol-(3,4,5)-
triphosphate
7
Our Approach (IntEx System)
  • Identify syntactic roles, such as Subject, Object
    , Verb and modifiers of a sentence.
  • Using these syntactic roles, transform complex
    sentences into multiple simple clauses.
  • Extract Protein-Protein interactions from these
    simple clausal structures.
  • Simple Pronoun resolution to identify references
    across multiple sentences.

8
IntEx System Architecture
9
IntEx System Components
  • Pronoun Resolution
  • Tagging tagging biological entities with the
    help of biomedical and linguistic gazetteers.
  • Complex Sentence Processing splitting complex
    sentences into simple clausal structures made of
    up syntactic roles.
  • Interaction Extractor extracting complete
    interactions by analyzing the matching contents
    of syntactic roles and their linguistically
    significant combinations.

10
Pronoun Resolution
  • Pronouns in abstracts third person
  • - It, itself, them, themselves.
  • Replace pronouns with first noun group that
    matches the Person/number agreement.
  • Ku loads onto dsDNA ends and it can diffuse along
    the DNA in an energy-independent manner.

Ku loads onto dsDNA ends and Ku can diffuse along
the DNA in an energy-independent manner.
11
Tagging
  • Dictionary lookup using gene/protein gazetteers
    from UMLS, LocusLink etc..
  • To tag new gene names, we used regular
    expressions (alpha numeric names, combination of
    lower case and upper case characters etc..).
  • Some heuristics like using proper nouns, NP
    chunking to improve recall.
  • Interaction word list is derived from UMLS and
    WordNet.

12
Complex Sentence Structures
  • Independent clauses with connectives
  • Many dependent clauses with one independent
    clause with / without connectives
  • Multiple agents and goals in a single clause

Gene14 binds to Gene15 in response to Gene16 or
methylmethanesulfonate this interaction does
not require Gene17..
Gene57 is blocked by Gene61, which binds to
Gene62.
Gene96 or Gene97 competes with Gene98 for binding
to Gene99 and Gene100 or Gene101 stimulates
Gene102 in vitro in the absence of Gene105.
13
Complex Sentence Processing
  • Upon growth factor stimulation of quiescent
    cells, Gene100 declines
  • late in Gene101 and Gene102 is replaced by
    Gene103, which is absent
  • in quiescent cells.

Upon growth factor stimulation of quiescent
cells, Gene100 declines late in Gene101.
Gene102 is replaced by Gene103.
Gene103 is absent in quiescent cells.
14
Complex Sentence Processing
  • Verb-based approach.
  • Identify clauses in complex sentences using Link
    Grammar Linkages
  • Build simple clause sentences from them (for each
    main verb) in the following Clause Format
  • Subject Verb Object Modifying phrase

15
Link Grammar Parser(Sleator, D. and D. Temperley
,1993)
  • Sentence The cat chased a snake
  • Link Grammar Representation


16
Interaction Extractor Role Type Matching
Various syntactic roles (such as Subject , Object
and Modifying phrase) and their linguistically
significant combinations makes up roles
17
Roles Examples
HMBA could inhibit the MEC-1 cell proliferation
by down-regulation of PCNA expression.
Elementary (Subject)
Elementary (Object)
Interaction (Verb)
Partial (Modifying Phrase)
18
Interaction Extractor Algorithm
Is Main Verb an Interaction (I) ?
Interaction G1, I, G2
Interaction G1, I, G2
Elementary (G1)
Partial (I,G2)
Elementary (G2)
complete (G,I,G) ? interact G,I,G
complete (G,I,G) ? interact G,I,G
complete (G,I,G) ? interact G,I,G
19
Interaction Extractor Example
HMBA could inhibit the MEC-1 cell proliferation
by down-regulation of PCNA expression.
Main Verb
HMBA, down-regulation, PCNA expression
Elementary
Elementary
HMBA, inhibit, the MEC-1 cell
proliferation
Partial
20
A Detailed Overall Example
21
Evaluation (Recall comparison with BioRAT)
IntEx and BioRAT from 229 abstracts when
compared with DIP database. DIP (Database of
Interacting Proteins) is a database of proteins
that interact, and is curated from both abstracts
and full text.
22
Evaluation (Precision comparison with BioRAT)
Precision comparison of IntEx and BioRAT from
229 abstracts.
23
Errors Analysis
24
Future Work in Interaction Extraction
  • Handling negations in the sentences (such as not
    interact, fails to induce, does not
    inhibit).
  • Extraction of detailed contextual attributes of
    interactions (such as bio-chemical context or
    location) by interpreting modifiers
  • Location/Position modifiers (in, at, on, into,
    up, over)
  • Agent/Accompaniment modifiers (by, with)
  • Purpose modifiers( for)
  • Theme/association modifiers ( of..)
  • Extraction of relationships between interactions
    from among multiple sentences within and across
    abstracts/full text articles. (Protein
    Interaction Pathways)

25
A bigger future combining automated extraction
with mass collaboration
  • Curation is expensive.
  • Automated extraction miles to go
  • Vision automated extraction with mass curation
  • The CBioC system www.cbioc.org

26
Future Work Visualization
27
Conclusion
  • Verb-based approach to extract protein-protein
    interactions
  • Handles complex sentences
  • Easy to scale up , and to use in other domains
    (we are working on it to use on other domains
    too).
  • Protein name tagging needs improvement, and we
    are working on using other methods.
  • First release version is almost ready for both
    Windows and Linux platforms.

28
References
  • Link Grammar
  • http//www.link.cs.cmu.edu/link
  • LocusLink (Now Entrez Gene)
  • http//www.ncbi.nlm.nih.gov/LocusLink
  • UMLS
  • http//www.nlm.nih.gov/research/umls/umlsmain.htm
    l

29
References (cont.)
  • Blaschke, C., M. A. Andrade, et al. (1999).
    "Automatic extraction of biological information
    from scientific text Protein-protein
    interactions." Proceedings of International
    Symposium on Molecular Biology 60-67.
  • Corney, D. P. A., B. F. Buxton, et al. (2004).
    "BioRAT extracting biological information from
    full-length papers." Bioinformatics 20(17)
    3206-3213.
  • Friedman, C., P. Kra, et al. (2001). GENIES a
    natural-language processing system for the
    extraction of molecular pathways from journal
    articles. Proceedings of the International
    Confernce on Intelligent Systems for Molecular
    Biology 574-82.
  • Rzhetsky, A., I. Iossifov, et al. (2004).
    "GeneWays a system for extracting, analyzing,
    visualizing, and integrating molecular pathway
    data." J. of Biomedical Informatics 37(1)
    43--53.
  • Seymore, K., A. McCallum, et al. (1999). Learning
    hidden markov model structure for information
    extraction. AAAI 99 Workshop on Machine Learning
    for Information Extraction
  • Sleator, D. and D. Temperley (1993). Parsing
    English with a Link Grammar. Third International
    Workshop on Parsing Technologies.

30
Demo
31
  • Thank you !
Write a Comment
User Comments (0)
About PowerShow.com