Medical Informatics - PowerPoint PPT Presentation

About This Presentation
Title:

Medical Informatics

Description:

... Unified Medical Language System (UMLS) Gene World Biomedical World Molecular Function = elemental activity/task the tasks performed by individual gene products; ... – PowerPoint PPT presentation

Number of Views:524
Avg rating:3.0/5.0
Slides: 48
Provided by: uce61
Learn more at: https://homepages.uc.edu
Category:

less

Transcript and Presenter's Notes

Title: Medical Informatics


1
Medical Informatics
Genomics and Bioinformatics
2
Identification and Prioritization of Novel
Disease Candidate Genes Systems Biology Based
Integrative Approaches
Bioinformatics to Systems Biology November 16,
2007
  • Anil Jegga
  • Division of Biomedical Informatics,
  • Cincinnati Childrens Hospital Medical Center
    (CCHMC)
  • Department of Pediatrics, University of
    Cincinnati
  • Cincinnati, Ohio - 45229
  • Anil.Jegga_at_cchmc.org
  • http//anil.cchmc.org

3
Acknowledgements
  • Jing Chen
  • Eric Bardes
  • Bruce Aronow
  • All the publicly available gene annotation
    resources especially NCBI, MGI and UCSC

Support
4
Two Separate Worlds..
Medical Informatics
Bioinformatics the omes
PubMed
Proteome
Disease Database
Patient Records
OMIM Clinical Synopsis
Clinical Trials
382 omes so far and there is UNKNOME too -
genes with no function known http//omics.org/inde
x.php/Alphabetically_ordered_list_of_omics (as
on November 15, 2007)
With Some Data Exchange
5
PubMed
miRNAome
Pharmacogenome
OMIM
6
No Integrative Genomics is Complete without
Ontologies
  • Gene Ontology (GO)
  • Unified Medical Language System (UMLS)

7
The 3 Gene Ontologies
  • Molecular Function elemental activity/task
  • the tasks performed by individual gene products
    examples are carbohydrate binding and ATPase
    activity
  • What a product does, precise activity
  • Biological Process biological goal or objective
  • broad biological goals, such as dna repair or
    purine metabolism, that are accomplished by
    ordered assemblies of molecular functions
  • Biological objective, accomplished via one or
    more ordered assemblies of functions
  • Cellular Component location or complex
  • subcellular structures, locations, and
    macromolecular complexes examples include
    nucleus, telomere, and RNA polymerase II
    holoenzyme
  • is located in (is a subcomponent of )

http//www.geneontology.org
8
Example Gene Product hammer
Function (what) Process (why) Drive a nail -
into wood Carpentry Drive stake - into soil
Gardening Smash a bug Pest Control A
performers juggling object Entertainment
http//www.geneontology.org
9
Unified Medical Language System Knowledge Server
UMLSKS
  • The UMLS Metathesaurus contains information about
    biomedical concepts and terms from many
    controlled vocabularies and classifications used
    in patient records, administrative health data,
    bibliographic and full-text databases, and expert
    systems.
  • The Semantic Network, through its semantic types,
    provides a consistent categorization of all
    concepts represented in the UMLS Metathesaurus.
    The links between the semantic types provide the
    structure for the Network and represent important
    relationships in the biomedical domain.
  • The SPECIALIST Lexicon is an English language
    lexicon with many biomedical terms, containing
    syntactic, morphological, and orthographic
    information for each term or word.

http//umlsks.nlm.nih.gov/kss
10
Unified Medical Language SystemMetathesaurus
  • about over 1 million biomedical concepts
  • About 5 million concept names from more than 100
    controlled vocabularies and classifications (some
    in multiple languages) used in patient records,
    administrative health data, bibliographic and
    full-text databases and expert systems.
  • The Metathesaurus is organized by concept or
    meaning. Alternate names for the same concept
    (synonyms, lexical variants, and translations)
    are linked together.
  • Each Metathesaurus concept has attributes that
    help to define its meaning, e.g., the semantic
    type(s) or categories to which it belongs, its
    position in the hierarchical contexts from
    various source vocabularies, and, for many
    concepts, a definition.
  • Customizable Users can exclude vocabularies that
    are not relevant for specific purposes or not
    licensed for use in their institutions.
    MetamorphoSys, the multi-platform Java install
    and customization program distributed with the
    UMLS resources, helps users to generate
    pre-defined or custom subsets of the
    Metathesaurus.
  • Uses
  • linking between different clinical or biomedical
    vocabularies
  • information retrieval from databases with human
    assigned subject index terms and from free-text
    information sources
  • linking patient records to related information in
    bibliographic, full-text, or factual databases
  • natural language processing and automated
    indexing research

11
Open biomedical ontologies
http//obo.sourceforge.net/
12
Mammalian Phenotype Ontology
  1. The Mammalian Phenotype (MP) Ontology enables
    robust annotation of mammalian phenotypes in the
    context of mutations, quantitative trait loci and
    strains that are used as models of human biology
    and disease.
  2. Each node in MPO represents a category of
    phenotypes and each MP ontology term has a unique
    identifier, a definition, synonyms, and is
    associated with gene variants causing these
    phenotypes in genetically engineered or
    mutagenesis experiments.
  3. In the current version of MPO, there are gt4250
    terms associated to gt4300 unique Entrez mouse
    genes (extrapolated to 4300 orthologous human
    genes).

http//www.informatics.jax.org
13
Disease Gene Identification and Prioritization
Hypothesis Majority of genes that impact or
cause disease share membership in any of several
functional relationships OR Functionally similar
or related genes cause similar phenotype.
  • Functional Similarity Common/shared
  • Gene Ontology term
  • Pathway
  • Phenotype
  • Chromosomal location
  • Expression
  • Cis regulatory elements (Transcription factor
    binding sites)
  • miRNA regulators
  • Interactions
  • Other features..

14
Background, Problems Issues
  1. Most of the common diseases are multi-factorial
    and modified by genetically and mechanistically
    complex polygenic interactions and environmental
    factors.
  2. High-throughput genome-wide studies like linkage
    analysis and gene expression profiling, tend to
    be most useful for classification and
    characterization but do not provide sufficient
    information to identify or prioritize specific
    disease causal genes.

15
Background, Problems Issues
  1. Since multiple genes are associated with same or
    similar disease phenotypes, it is reasonable to
    expect the underlying genes to be functionally
    related.
  2. Such functional relatedness (common pathway,
    interaction, biological process, etc.) can be
    exploited to aid in the finding of novel disease
    genes. For e.g., genetically heterogeneous
    hereditary diseases such as Hermansky-Pudlak
    syndrome and Fanconi anaemia have been shown to
    be caused by mutations in different interacting
    proteins.

16
Background, Problems Issues
  • Disease candidate gene studies

Biological experiments (expensive, time
consuming)
17
Background, Problems Issues
Current candidate gene prioritization tools
  • Assumption genes involved in the same complex
    disease will have similar functions

dilated cardiomyopathy
Approach with training
Training Known disease genes (10 from OMIM)
Test 68 genes at 10q25-26
Score test genes based on their similarity to
training set
18
TOPPGeneTranscriptome Ontology Pathway based
Prioritization of Geneshttp//toppgene.cchmc.org
Chen J, Xu H, Aronow BJ, Jegga AG. 2007. Improved
human disease candidate gene prioritization using
mouse phenotype. BMC Bioinformatics 8(1) 392
Epub ahead of print
  • Applications
  • For functional enrichment
  • For candidate gene prioritization

Why another gene prioritization method?
19
Comparison with other related approaches
Feature type POCUS Prospectr SUSPECTS ENDEAVOUR ToppGene
Year 2003 2005 2006 2006 2007
Sequence Features
GO Annotations
Transcript Features
Protein Features
Literature
Phenotype Annotations
Training set
20
Comparison with other related approaches Feature
Details
Feature type POCUS Prospectr SUSPECTS ENDEAVOUR ToppGene
Year 2003 2005 2006 2006 2007
Sequence Features Annotations Gene length Homology Base composition Gene length Homology Base composition Blast cis-element Cytoband cis-element miRNA targets GeneSets
Gene Annotations Gene Ontology Gene Ontology Gene Ontology Gene Ontology Mouse Phenotype
Transcript Features Gene expression Gene expression EST expression Gene expression
Protein Features domains Protein domains domains interactions pathways domains interactions pathways
Literature Keywords Co-citation
Training set No No Yes Yes Yes
21
Mammalian Phenotype Ontology
We do not check whether the human orthologous
gene of a mouse gene causes similar phenotype.
Rather, we assume that orthologous genes cause
orthologous phenotype and test the potential of
the extrapolated mouse phenotype terms as a
similarity measure to prioritize human disease
candidate genes
22
Mammalian Phenotype Ontology
23
ToppGene General Schema
24
TOPPGene - Data Sources
  • Gene Ontology GO and NCBI Entrez Gene
  • Mouse Phenotype MGI (used for the first time for
    human disease gene prioritization)
  • Pathways KEGG, BioCarta, BioCyc, Reactome,
    GenMAPP, MSigDB
  • Domains UniProt (Pfam, Interpro,etc.)
  • Interactions NCBI Entrez Gene (Biogrid,
    Reactome, BIND, HPRD, etc.)
  • Pubmed IDs NCBI Entrez Gene
  • Expression GEO
  • Cytoband MSigDB
  • Cis-Elements MSigDB
  • miRNA Targets MSigDB

New features added
25
TOPPGene - Validation
  • Random-gene cross-validation
  • Disease-gene relations from OMIM and GAD
    databases
  • Training set disease genes with one gene
    (target) removed
  • Test set 100 genes target gene 99 random
    genes
  • Rank of target gene
  • Control random training sets
  • AUC and Sensitivity/Specificity

26
TOPPGene - Validation
  • Random-gene cross-validation breast cancer
    example

27
  • Random-gene cross-validation result
  • Training19 diseases with 693 genes
  • Control 20 random sets of 35 genes each
  • Sensitivity/Specificity 77/90
  • AUC 0.916
  • Sensitivity frequency of target genes that are
    ranked above a particular threshold position
  • Specificity the percentage of genes ranked below
    the threshold

28
Using Mouse Phenotype as a feature of similarity
measure improves human disease gene prioritization
  • Random-gene cross-validation with only one feature

29
Using Mouse Phenotype as a feature of similarity
measure improves human disease gene prioritization
Random-gene cross-validation by leaving one
feature out
Overall performance All features 0.913 All MP
0.893 All MP PubMed 0.888
Sensitivity true positive rate at a cutoff
score Specificity true negative rate at the same
cutoff
30
  • Locus-region cross-validation using different
    feature sets

Features Average rank ratio of target genes Number of times target genes were ranked top 5 Number of times target genes were ranked top 10
All 7.39 118 125
GO MP PubMed 7.50 118 126
MP PubMed 7.08 121 126
Without GO 6.84 117 123
Without Pathway 7.66 118 124
Without Domain 6.71 118 124
Without Interaction 7.17 120 124
Without Expression 7.28 118 128
Without MP 9.77 110 117
Without Pubmed 9.91 100 111
Without MP Pubmed 22.61 71 80
31
  • ToppGene web server (http//toppgene.cchmc.org)
  • For functional enrichment analysis

32
  • ToppGene web server (http//toppgene.cchmc.org)
  • For functional enrichment analysis

33
  • ToppGene web server (http//toppgene.cchmc.org)
  • For functional enrichment analysis

34
  • ToppGene web server (http//toppgene.cchmc.org)
  • For functional enrichment analysis

35
PPI - Predicting Disease Genes
  1. Direct proteinprotein interactions (PPI) are one
    of the strongest manifestations of a functional
    relation between genes.
  2. Hypothesis Interacting proteins lead to same or
    similar disease phenotypes when mutated.
  3. Several genetically heterogeneous hereditary
    diseases are shown to be caused by mutations in
    different interacting proteins. For e.g.
    Hermansky-Pudlak syndrome and Fanconi anaemia.
    Hence, proteinprotein interactions might in
    principle be used to identify potentially
    interesting disease gene candidates.

36
  • Prioritize candidate genes in the interacting
    partners of the disease-related genes
  • Training sets disease related genes
  • Test sets interacting partners of the training
    genes

37
  • Example Breast cancer

OMIM genes (level 0) Directly interacting genes (level 1) Indirectly interacting genes (level2)
15 342 2469!
15
342
2469
38
  • ToppGene web server (http//toppgene.cchmc.org)
  • For candidate gene prioritization

39
  • ToppGene web server (http//toppgene.cchmc.org)
  • For candidate gene prioritization

40
  • ToppGene web server (http//toppgene.cchmc.org)
  • For candidate gene prioritization

41
  • Example Breast cancer study. Genome-wide
    association study identifies novel breast cancer
    susceptibility loci. Nature. 2007 May 27.

rs id Location Gene Training set Test set
rs2981582 10q26 FGFR2 15 OMIM genes 83 genes in the region
Prioritization result
Rank Gene Description P-value
1 BUB3 budding uninhibited by benzimidazoles 3 homolog 0.003865
2 FGFR2 fibroblast growth factor receptor 2 0.018906
3 BCCIP BRCA2 and CDKN1A interacting protein 0.04784
42
Example Breast cancer study. Genome-wide
association study identifies novel breast cancer
susceptibility loci. Nature. 2007 May 27.
43
ToppGene Prioritization
  • Example Breast cancer

Training set Test set
15 OMIM genes 342 interacting genes
Ranked Interactants
Rank Gene Description
1 ATR ataxia telangiectasia and Rad3 related
2 FANCD2 Fanconi anemia, complementation group D2
3 NBN (NBS1) Nibrin
44
Limitations
  • General limitations of any training-test
    strategy
  • Prior knowledge of disease-gene associations.
  • Assumption that the disease genes yet to discover
    will be consistent with what is already known
    about a disease.
  • Depend on the accuracy and completeness of the
    functional annotations.
  • Only one-fifth of the known human genes have
    pathway or phenotype annotations and there are
    still more than 40 genes whose functions are not
    defined!

Chen et al., 2007 BMC Bioinformatics
45
Mouse Phenotype - Limitations
  1. MP is not a disease-centric ontology and the
    phenotype of a same gene mutation can vary
    depending on specific mouse strains or their
    genetic backgrounds.
  2. Orthologous genes need not necessarily result in
    orthologous phenotypes.

Possible Solutions - Future Directions More
efficient cross-species phenome extrapolation
where in the mouse phenotype terms are mapped to
human phenotype concepts (from UMLS) semantically
(orthologous phenotype) and the resultant
orthologous genes associated with an orthologous
phenotype are identified.
Chen et al., 2007 BMC Bioinformatics
46
PPIs for disease gene identification Limitations
  • Noisy interactome data
  • In vitro Vs in vivo (for e.g. only 5.8 of yeast
    two-hybrid predicted interactions were confirmed
    by HPRD)
  • Extrapolation of interactions from one species to
    another
  • Bias towards well-studied genes/proteins
  • Too many interactants! Hub proteins
  • Two interacting proteins need not lead to similar
    phenotype when mutated
  • Disease proteins may lie at different points in a
    pathway and need not interact directly
  • Lastly, disease mutations need not always involve
    proteins

Oti et al., 2006 J Med Gen
47
http//anil.cchmc.org (under presentations)
http//sbw.kgi.edu/
Write a Comment
User Comments (0)
About PowerShow.com