Bioinformatics Tools for Biomarkers Discovery - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Bioinformatics Tools for Biomarkers Discovery

Description:

(Troy Anderson et al) 22. Decision Rule. Decision Rule: IF Ratio530 Ratio786 THEN Cancer, ... of-the-art classifiers (PAM, SVM) in classifying gene expression ... – PowerPoint PPT presentation

Number of Views:377
Avg rating:3.0/5.0
Slides: 26
Provided by: act1
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Tools for Biomarkers Discovery


1
Bioinformatics Tools for Biomarkers Discovery
  • Stephen GRANITE Aik Choon TAN
  • Prof. Raimond L. Winslow rwinslow_at_jhu.edu,
    Director, CCBM,
  • Prof. Donald Geman geman_at_jhu.edu,
  • Prof. Daniel Naiman daniel.naiman_at_jhu.edu,
  • Lei Xu leixu_at_jhu.edu,
  • Troy Anderson troy_anderson_at_jhu.edu
  • The Institute for Computational Medicine and
  • Center for Cardiovascular Bioinformatics and
    Modeling (CCBM),
  • Johns Hopkins University

IBM/CCBM Post-Doc Research Fellow actan_at_jhu.edu
Director, Software/Database Development sgranite_at_j
hu.edu
2
Biomarkers Discovery Workflow
Clinical Applications
Candidate Biomarkers
Sample Collection
Follow-up Study
Decision Rules
Patients
Transcriptomics Pipeline
MAGE-DB2
Machine Learning
Store
Gene Expression Profiling
Relative Expression Reversal Classifiers
Experiments
Query
PROTEIN-DB2
Store
Mass Spectrometry
Query
Proteomics Pipeline
Available at CCBM
Store
Difference Gel Electrophoresis
3
Outline
  • Multi-scale biomedical data repositories
  • System Architecture
  • Relative Expression Reversal Classifiers
  • TSP k-TSP classifiers
  • Microarray Gene expression data
  • Results on binary multi-class disease
    classification problems
  • Data Integration and Cross-platform analysis
  • Difference Gel Electrophoresis (DIGE) Proteomics
    data
  • Results on disease classification
  • Conclusions

4
Multi-scale Biomedical Repositories
  • The MAGE-DB2 Project is developing a full
    relational mapping of the MicroArray Gene
    Expression (MAGE) object model (OM) optimized to
    run on IBMs scalable, parallel database DB2.
  • The PROTEIN-DB2 Project is developing an
    open-source relational implementation of the
    Protein Standards Initiative (PSI) object model
    for storing complete descriptions of a range of
    proteomic experimental data and analyses.

(Granite et al)
5
PROTEIN-DB2 Primary Data / Analysis Storage
  • Two-dimensional Gel Electrophoresis
    Images/Analyses
  • 2D-PAGE / Nonlinear Dynamics Progenesis Analysis
  • DIGE / GE Amersham DeCyder Analysis
  • Two-dimensional Liquid Chromatography
  • Beckman-Coulter PF2D primary data
  • Protein Array
  • Beckman-Coulter A2 primary data
  • Mass Spectrometry (MS) primary data / mzXML
    translation
  • Applied Biosystems Voyager
  • ABI/SCIEX QStar
  • Shimadzu Axima
  • ThermoFinnigan LCQ and LTQ
  • MS Search Results
  • Matrix Sciences Mascot HTML and XML output

(Granite et al)
6
MAGE-DB2/PROTEIN-DB2 Architecture
(Granite et al)
7
MAGE-DB2/PROTEIN-DB2 Webpages
http//proteomics.jhu.edu/dl/pathidb.php
http//lpar4.wbmei.jhu.edu/wps/portal
(Granite et al)
8
Relative Expression Reversal Classifiers
  • Pairwise rank-based comparisons (relative
    expression values within each array)
  • Generates accurate and simple decision rules
  • TSP classifier Top Scoring Pair
  • k-TSP classifier k-disjoint Top Scoring Pairs
  • Data driven, parameter-free learning algorithm
  • Performance comparable to or exceeds that of
    other machine learning methods
  • Easy to interpret, facilitating follow-up study
    (small number of genes)

(Tan et al., 2005, Bioinformatics, 213896-3904)
9
Rank-based Classification
  • Novelty Replace the measured expression values
    by their ranks within profiles, hence obtaining
    invariance to normalization.
  • Example Differentiate between classes by finding
    pairs of genes whose ordering typically changes
    from Normal to Disease.
  • Simple Interpretation Inversion of mRNA
    abundance.

(From D. Geman)
10
TSP Classifier
  • For each pair of genes (i, j), i ? j, 1 i, j
    G, compute
  • Pij(Normal) (Ri Rj / Normal)
  • Pij(Disease) (Ri Rj / Disease)
  • ?ij Pij(Normal) Pij(Disease)
  • Select only the top scoring pairs
  • (i, j) ?ij ?max
  • TSP classifier (hTSP) is based on these pairs
  • Example Let all the top scoring pairs vote
    (Geman et al, 2004)
  • Example Select one unique top scoring pair,
    based on maximizing difference in ranks (i, j)
    (Tan et al, 2005)
  • Prediction Suppose Pij(Normal) Pij(Disease),
    xnew new profile
  • If, on the other hand, if Pij(Disease)
    Pij(Normal), then the decision rule is reversed.

(Tan et al., 2005, Bioinformatics, 213896-3904)
11
k-TSP Classifier
  • Uses exactly k top disjoint pairs in prediction.
  • k is determined by internal cross-validation
  • Ensemble learning to combine the discriminating
    power of many weaker rules to make more
    reliable predictions.
  • Prediction
  • Suppose xnew new profile, each gene pair (iu,
    ju), u 1,, k, votes according (1).
  • The k-TSP classifier hk-TSP employs an unweighted
    majority voting procedure to obtain the final
    prediction of ynew.

(Tan et al., 2005, Bioinformatics, 213896-3904)
12
Microarray Data Sets
(Binary class Problems)
(Multi-class Problems)
(Tan et al., 2005, Bioinformatics, 213896-3904)
13
Results(LOOCV Binary Class Problems)
Number of Informative Genes
(Tan et al., 2005, Bioinformatics, 213896-3904)
14
Results(Test Accuracy for Multi-Class Problems)
Number of Informative Genes
(Tan et al., 2005, Bioinformatics, 213896-3904)
15
(a) TSP
ALL
AML
IF SPTAN1 ? CD33 THEN ALL ELSE AML ? 0.9787
(b) k-TSP
IF SPTAN1 ? CD33 THEN ALL ELSE AML ?
0.9787 IF HA-1 ? ZYX THEN ALL ELSE AML ?
0.9787 IF TCF3 APLP2 THEN ALL ELSE AML ?
0.9574 IF ATP2A3 ? CST3 THEN ALL ELSE AML ?
0.9387 IF DGKD MGST1 THEN ALL ELSE AML ?
0.9387 IF CCND3 ? NPC2 THEN ALL ELSE AML ?
0.9387 IF TOP2B PLCB2 THEN ALL ELSE AML ?
0.9387 IF Macmarcks ? CTSD THEN ALL ELSE AML ?
0.9362 IF PSMB8 ? DF THEN ALL ELSE AML ?
0.9200
Genes previously identified by Golub et al
(1999)
(Tan et al., 2005, Bioinformatics, 213896-3904)
16
Direct Data Integration
Lab A
Lab X
Lab B
Lab Y
Lab C
(Lei Xu et al, 2005, Bioinformatics, 213905-3911)
17
Data Sets
(Lei Xu et al, 2005, Bioinformatics, 213905-3911)
18
TSPs from Data Integration
(Lei Xu et al, 2005, Bioinformatics, 213905-3911)
19
Results on Test Set
Comparisons of Marker TSP with Individual TSPs
(Lei Xu et al, 2005, Bioinformatics, 213905-3911)
20
Marker TSP for Prostate Cancer
  • HPN (Hepsin) biomarker candidate for prostate
    cancer
  • STAT6 (Signal transduction and translation
    protein)

IF HPN STAT6 THEN Prostate Cancer ELSE Normal
PSA (Prostate Specific Antigen) Sn 67.5 80
, Sp 60 - 70 TSP (HPN, STAT6) Sn 91.7,
Sp 97.7 (From this study!)
(Lei Xu et al, 2005, Bioinformatics, 213905-3911)
21
DIGE Technology
(From http//www5.amershambiosciences.com)
Proteomics Data
Experimental Settings
Gels
18 experiments Cy2 Internal Standards (18) Cy3
Cancer gels (18) Cy5 Normal gels (18) 1098
protein spots (BVA ratios from DeCyder software)
(Troy Anderson et al)
22
Decision Rule
Decision Rule IF Ratio530 ? Ratio786 THEN
Cancer, ELSE Normal. LOOCV Results Accuracy
97.2 (35/36) Sensitivity 100
(18/18) Specificity 94.4 (17/18)
(Troy Anderson et al)
23
Protein Marker Spots
(Troy Anderson et al)
24
http//www.ccbm.jhu.edu
25
Conclusions
  • Bioinformatics tools to facilitate biomarkers
    discovery
  • k-TSP is comparable with the state-of-the-art
    classifiers (PAM, SVM) in classifying gene
    expression profiles
  • k-TSP generates simple and accurate decision
    rules
  • Biological significance
  • Easy to interpret
  • Potential clinical applications
  • Allow direct data integration without
    performing normalization
  • Allow cross-platform analysis
  • Applicable to a wide-range of high-throughput
    data
Write a Comment
User Comments (0)
About PowerShow.com