Semantic empowerment of Life Science Applications October 2006 - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Semantic empowerment of Life Science Applications October 2006

Description:

Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, ... and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson. ... Robert Robbins ' ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 48
Provided by: amit196
Category:

less

Transcript and Presenter's Notes

Title: Semantic empowerment of Life Science Applications October 2006


1
Semantic empowermentof Life Science
Applications October 2006
  • Amit Sheth
  • LSDIS Lab, Department of Computer Science,
  • University of Georgia

Acknowledgement NCRR funded Bioinformatics of
Glycan Expression, collaborators, partners at
CCRC (Dr. William S. York) and Satya S. Sahoo,
Cartic Ramakrishnan, Christopher Thomas, Cory
Henson.
2
Computation, data and semantics In life sciences
  • The development of a predictive biology will
    likely be one of the major creative enterprises
    of the 21st century. Roger Brent, 1999
  • The future will be the study of the genes and
    proteins of organisms in the context of their
    informational pathways or networks. L. Hood,
    2000
  • "Biological research is going to move from being
    hypothesis-driven to being data-driven." Robert
    Robbins
  • Well see over the next decade complete
    transformation (of life science industry) to very
    database-intensive as opposed to wet-lab
    intensive. Debra Goldfarb
  • We will show how semantics is a key enabler for
    achieving the above predictions and visions in
    which information and process play critical role.

3
Semantic Web and Life Science
  • Data captured per year 1 exabyte (1018)(Eric
    Neumann, Science, 2005)
  • How much is that?
  • Compare it to the estimate of the total words
    ever spoken by humans 12 exabyte
  • Death by data
  • The need for
  • Search
  • Integration
  • Analysis, decision support
  • Discovery

Not data, but analysis and insight, leading to
decisions and discovery
4
Semantic empowermentof Life Science Applications
  • Life Science research today deals with highly
    heterogeneous as well as massive amounts of data
    distributed across the world.
  • We need more automated ways for integration and
    analysis leading to insight and discovery
  • - to understand cellular components, molecular
    functions and biological processes, and more
    importantly complex interactions and
    interdependencies between them.

5
Benefits of Semantics
  • Development of large domain-specific knowledge
  • for reference, common nomenclature, tagging
  • Integration of heterogeneous multi-source data
    biomedical documents (text), scientific/experiment
    al data and structured databases
  • Semantic search, browsing, integration analysis,
    and discovery
  • Faster and more reliable discovery leading to
    quality of life improvements

6
What is semantics Semantic Web
  • Meaning and use of data
  • From syntax and structure to semantics (beyond
    formatting, organization, query interfaces,.)
  • XML -gt RDF -gt OWL -gt Rules -gt Trust
  • Ontologies at the heart of Semantic Web,
    capturing agreement and domain knowledge
  • (Automatic) Semantic annotation, reasoning,
  • Also, increasing use of Services oriented
    Architecture -gt semantic Web services
  • W3C SW for Health Care and Life Sciences

7
Semantic empowermentof Life Science Applications
  • This talk will demonstrate some of the efforts
    in
  • Building large (populated) life science
    ontologies (GlycO, ProPreO)
  • Gathering/extracting knowledge and metadata
    entity and relationship extraction from
    unstructured data, automatic semantic annotation
    of scientific/experimental data (e.g., mass
    spectrometry)
  • Semantic web services and registries, leading to
    better discovery/reuse of scientific tools and
    their composition
  • Ontology-driven applications developed

8
Semantic Applications
  • Active Semantic Medical Records Demo an
    operational health care application using
    multiple ontologies, semantic annotations and
    rule based decsion support
  • Semantic Browser Demo contextual browsing of
    PubMed aided by ontology and schema (in future
    instance) level relationships
  • N-glycosylation process an example of scientific
    workflow
  • Integrated Semantic Information Knowledge
    System (ISIS) integrated access and analysis of
    structured databases, sc. literature and
    experimental data
  • Others we will not discuss SemBowser, SemDrug,
    .

Let us start with a couple of simple applications
9
Life Science Ontologies
  • Glyco
  • An ontology for structure and function of
    Glycopeptides
  • 573 classes, 113 relationships
  • Published through the National Center for
    Biomedical Ontology (NCBO)
  • ProPreO
  • An ontology for capturing process and lifecycle
    information related to proteomic experiments
  • 398 classes, 32 relationships
  • 3.1 million instances
  • Published through the National Center for
    Biomedical Ontology (NCBO) and Open Biomedical
    Ontologies (OBO)

10
N-Glycosylation metabolic pathway
GNT-Iattaches GlcNAc at position 2
11
GlycO ontology
  • Challenge model hundreds of thousands of
    complex carbohydrate entities
  • But, the differences between the entities are
    small (E.g. just one component)
  • How to model all the concepts but preclude
    redundancy ? ensure maintainability, scalability

12
GlycoTree
N. Takahashi and K. Kato, Trends in Glycosciences
and Glycotechnology, 15 235-251
13
EnzyO
  • The enzyme ontology EnzyO is highly intertwined
    with GlycO. While its structure is mostly that
    of a taxonomy, it is highly restricted at the
    class level and hence allows for comfortable
    classification of enzyme instances from multiple
    organisms
  • GlycO together with EnzyO contain all the
    information that is needed for the description of
    Metabolic pathways
  • e.g. N-Glycan Biosynthesis

14
Pathway representation in GlycO
Pathways do not need to be explicitly defined in
GlycO. The residue-, glycan-, enzyme- and
reaction descriptions contain all the knowledge
necessary to infer pathways.
15
Zooming in a little
The N-Glycan with KEGG ID 00015 is the substrate
to the reaction R05987, which is catalyzed by an
enzyme of the class EC 2.4.1.145.
The product of this reaction is the Glycan with
KEGG ID 00020.
16
GlycO population
  • Multiple data sources used in populating the
    ontology
  • KEGG - Kyoto Encyclopedia of Genes and Genomes
  • SWEETDB
  • CARBANK Database
  • Each data source has different schema for storing
    data
  • There is significant overlap of instances in the
    data sources
  • Hence, entity disambiguation and a common
    representational format are needed

17
Ontology population workflow
18
Ontology population workflow
Asn(41)b-D-GlcpNAc (41)b-D-GlcpNAc
(41)b-D-Manp (31)a-D-Manp (21)b-
D-GlcpNAc (41)b-D-GlcpNAc(61)a-D-M
anp (21)b-D-GlcpNAc
19
Ontology population workflow
ltGlycangt ltaglycon name"Asn"/gt ltresidue
link"4" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"GlcNAc"gt
ltresidue link"4" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"GlcNAc"gt
ltresidue link"4" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"Man" gt
ltresidue link"3" anomeric_carbon"1" anomer"a"
chirality"D" monosaccharide"Man" gt
ltresidue link"2" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"GlcNAc" gt
lt/residuegt ltresidue link"4"
anomeric_carbon"1" anomer"b" chirality"D"
monosaccharide"GlcNAc" gt lt/residuegt
lt/residuegt ltresidue link"6"
anomeric_carbon"1" anomer"a" chirality"D"
monosaccharide"Man" gt ltresidue
link"2" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"GlcNAc"gt
lt/residuegt lt/residuegt lt/residuegt
lt/residuegt lt/residuegt lt/Glycangt
20
Ontology population workflow
21
ProPreO ontology
  • Two aspects of glycoproteomics
  • What is it? ? identification
  • How much of it is there? ? quantification
  • Heterogeneity in data generation process,
    instrumental parameters, formats
  • Need data and process provenance ?
    ontology-mediated provenance
  • Hence, ProPreO models both the glycoproteomics
    experimental process and attendant data

22
ProPreO population transformation to rdf
Scientific Data
Computational Methods
Ontology instances
23
ProPreO population transformation to rdf
Scientific Data
Computational Methods
Key
amino-acid sequence
Protein Data
Protein Path
amino-acid sequence
Extract Peptide Amino-acid Sequence from Protein
Amino-acid Sequence
Peptide Path
Calculate Chemical Mass
Calculate Monoisotopic Mass
Determine N-glycosylation Concensus
RDF
Chemical Mass RDF
Monoisotopic Mass RDF
Amino-acid Sequence RDF
chemical mass
monoisotopic mass
amino-acid sequence
n-glycosylation concensus
chemical mass
monoisotopic mass
amino-acid sequence
n-glycosylation concensus
parent protein
Protein RDF
Peptide RDF
24
Semantic empowermentof Life Science Applications
  • This talk will demonstrate some of the efforts
    in
  • building large life science ontologies (GlycO -an
    ontology for structure and function for
    Glycopeptides and ProPreO - an ontology for
    capturing process and lifecycle information
    related to proteomic experiments) and their
    application in advanced ontology-driven semantic
    applications
  • entity and relationship extraction from
    unstructured data, automatic semantic annotation
    of scientific/experimental data (e.g., mass
    spectrometry), and resulting capability in
    integrated access and analysis of structured
    databases, scientific literature and experimental
    data
  • semantic web services and registries, leading to
    better discovery/reuse of scientific tools and
    composition of scientific workflows that process
    high-throughput data and can be adaptive
  • semantic applications developed

25
Relationship extraction from unstructured
data (other related research biological entity
extraction)
26
Overview
UMLS
Biologically active substance
affects
complicates
causes
causes
Disease or Syndrome
instance_of
instance_of
???????
Fish Oils
Raynauds Disease
MeSH
9284 documents
PubMed
4733 documents
5 documents
27
About the data used
  • UMLS A high level schema of the biomedical
    domain
  • 136 classes and 49 relationships
  • Synonyms of all relationship using variant
    lookup (tools from NLM)
  • MeSH
  • Terms already asserted as instance of one or more
    classes in UMLS
  • PubMed
  • Abstracts annotated with one or more MeSH terms

T147effect T147induce T147etiology
T147cause T147effecting T147induced
28
Example PubMed abstract (for the domain expert)
29
Method Parse Sentences in PubMed
SS-Tagger (University of Tokyo)
SS-Parser (University of Tokyo)
(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ
endogenous) (CC or) (JJ exogenous) ) (NN
stimulation) ) (PP (IN by) (NP (NN estrogen) ) )
) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN
hyperplasia) ) (PP (IN of) (NP (DT the) (NN
endometrium) ) ) ) ) ) )
30
Method Identify entities and Relationships in
Parse Tree
31
Method Identify entities and Relationships in
Parse Tree
32
Method Fact Extraction from Parse Tree
33
Semantic annotation of scientific/experimental
data
34
ProPreO Ontology-mediated provenance
parent ion charge
830.9570 194.9604 2 580.2985
0.3592 688.3214 0.2526 779.4759
38.4939 784.3607 21.7736 1543.7476
1.3822 1544.7595 2.9977 1562.8113
37.4790 1660.7776 476.5043
parent ion m/z
parent ionabundance
fragment ion m/z
fragment ionabundance
ms/ms peaklist data
Mass Spectrometry (MS) Data
35
ProPreO Ontology-mediated provenance
ltms-ms_peak_listgt ltparameter instrumentmicromas
s_QTOF_2_quadropole_time_of_flight_mass_spectromet
er modems-ms/gt ltparent_ion
m-z830.9570 abundance194.9604
z2/gt ltfragment_ion m-z580.2985
abundance0.3592/gt ltfragment_ion
m-z688.3214 abundance0.2526/gt ltfragment_i
on m-z779.4759 abundance38.4939/gt ltfragme
nt_ion m-z784.3607 abundance21.7736/gt ltfr
agment_ion m-z1543.7476 abundance1.3822/gt
ltfragment_ion m-z1544.7595 abundance2.9977/
gt ltfragment_ion m-z1562.8113
abundance37.4790/gt ltfragment_ion
m-z1660.7776 abundance476.5043/gt lt/ms-ms_pea
k_listgt
OntologicalConcepts
Semantically Annotated MS Data
36
Semantic empowermentof Life Science Applications
  • This talk will demonstrate some of the efforts
    in
  • building large life science ontologies (GlycO -an
    ontology for structure and function for
    Glycopeptides and ProPreO - an ontology for
    capturing process and lifecycle information
    related to proteomic experiments) and their
    application in advanced ontology-driven semantic
    applications
  • entity and relationship extraction from
    unstructured data, automatic semantic annotation
    of scientific/experimental data (e.g., mass
    spectrometry), and resulting capability in
    integrated access and analysis of structured
    databases, scientific literature and experimental
    data
  • semantic web services and registries, leading to
    better discovery/reuse of scientific tools and
    composition of scientific workflows that process
    high-throughput data and can be adaptive
  • semantic applications developed

37
N-Glycosylation Process (NGP)
38
Semantic Web Process to incorporate provenance
Semantic Annotation Applications
39
Converting biological information to the W3C
Resource Description Framework (RDF) Experience
with Entrez Gene
  • Collaboration with Dr. Olivier Bodenreider
  • (US National Library of Medicine, NIH, Bethesda,
    MD)

40
Biomedical Knowledge Repository
.
Biomedical Knowledge Repository
Entrez
41
Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
42
Web interface
ENTREZ GENE
ENTREZ GENE XML
XSLT
ENTREZ GENE RDF GRAPH
ENTREZ GENE RDF
.
43
Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
44
XML
45
Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
46
RDF Graph
eghas_protein_reference_name_E
APP (geneid-351)
Alzheimers Disease
subject
predicate
object
47
RDF Graph
Entrez Gene RDF graph (W3C Validator Site -
http//www.w3.org/RDF/Validator/)
48
Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
49
RDF
50
Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
51
Connecting different genes
amyloid-beta protein
protease nexin-II
beta-amyloid peptide
APP gene Homo sapiens
A4 amyloid protein
cerebral vascular amyloid peptide
Human APP gene is implicated in Alzheimer's
disease. Which genes are functionally homologous
to this gene?
amyloid beta (A4) precursor protein (protease
nexin-II, Alzheimer disease)
amyloid beta A4 protein
amyloid beta A4 protein
amyloid beta A4 protein
APP gene Gallus gallus
APP gene Canis familiaris
amyloid protein
eghas_protein_reference_name_E
52
Inference
  • Rules are objects that allow inference from RDF
    data 1
  • Oracle 10g allows the creation of rulebase based
    on RDFS (RDF Schema)

amyloid beta (A4) precursor protein (protease
nexin-II, Alzheimer disease)
eghas_protein_reference_name_E
egis_associated_with
egGene-track_geneid/351
egNeurodegenerative Diseases
53
Integrated Semantic Information and knowledge
System (Isis)
Have I performed an error? Give me all result
files from a similar organism, cell,
preparation, mass spectrometric conditions and
compare results.
SPARQL query-based User Interface
ProPreO ontology
Is the result erroneous? Give me all result
files from a similar organism, cell,
preparation, mass spectrometric conditions and
compare results.
Experimental Data Semantic Annotation
Metadata File
Semantic Metadata Registry
PROTEOMECOMMONS
EXPERIMENTAL DATA
ProVault result
MACOT result
mzXML
Pkl
pSplit
Raw
Raw2mzXML
mzXML2Pkl
Pkl2pSplit
MASCOT Search
ProVault
PROTEOMICS WORKFLOW
54
Summary, Observations, Conclusions
  • We now have semantics and services enabled
    approaches that support semantic search, semantic
    integration, semantic analytics, decision support
    and validation (e.g., error prevention in
    healthcare), knowledge discovery, process/pathway
    discovery,

55
  • http//lsdis.cs.uga.edu
  • http//knoesis.org
  • http//lsdis.cs.uga.edu/projects/asdoc/
  • http//lsdis.cs.uga.edu/projects/glycomics/
Write a Comment
User Comments (0)
About PowerShow.com