Data Mining in Genomics: the dawn of personalized medicine - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining in Genomics: the dawn of personalized medicine

Description:

Bioinformatics. Customer Relationship Management (CRM) Database Marketing. Fraud Detection ... Human DNA has about 30-35,000 genes ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 35
Provided by: grego122
Category:

less

Transcript and Presenter's Notes

Title: Data Mining in Genomics: the dawn of personalized medicine


1
Data Mining in Genomics the dawn of personalized
medicine
  • Gregory Piatetsky-Shapiro
  • KDnuggets
  • www.KDnuggets.com/gps.html
  • Connecticut College, October 15, 2003

2
Overview
  • Data Mining and Knowledge Discovery
  • Genomics and Microarrays
  • Microarray Data Mining

3
Trends leading to Data Flood
  • More data is generated
  • Bank, telecom, other business transactions ...
  • Scientific Data astronomy, biology, etc
  • Web, text, and e-commerce
  • More data is captured
  • Storage technology faster and cheaper
  • DBMS capable of handling bigger DB

4
Knowledge Discovery Process
Integration
Interpretation Evaluation
Knowledge
Data Mining
Knowledge
RawData
Transformation
Selection Cleaning
Understanding
Transformed Data
Target Data
DATA Ware house
5
Major Data Mining Tasks
  • Classification predicting an item class
  • Clustering finding clusters in data
  • Associations e.g. A B C occur frequently
  • Visualization to facilitate human discovery
  • Summarization describing a group
  • Estimation predicting a continuous value
  • Deviation Detection finding changes
  • Link Analysis finding relationships

6
Major Application Areas for Data Mining Solutions
  • Advertising
  • Bioinformatics
  • Customer Relationship Management (CRM)
  • Database Marketing
  • Fraud Detection
  • eCommerce
  • Health Care
  • Investment/Securities
  • Manufacturing, Process Control
  • Sports and Entertainment
  • Telecommunications
  • Web

7
Genome, DNA Gene Expression
  • An organisms genome is the program for making
    the organism, encoded in DNA
  • Human DNA has about 30-35,000 genes
  • A gene is a segment of DNA that specifies how to
    make a protein
  • Cells are different because of differential gene
    expression
  • About 40 of human genes are expressed at one
    time
  • Microarray devices measure gene expression

8
Molecular Biology Overview
Nucleus
Cell
Chromosome
Gene expression
Gene (DNA)
Gene (mRNA), single strand
Protein
Graphics courtesy of the National Human Genome
Research Institute
9
Affymetrix Microarrays
1.28cm
107 oligonucleotides, half Perfectly Match mRNA
(PM), half have one Mismatch (MM) Gene
expression computed from PM and MM
10
Affymetrix Microarray Raw Image
Gene Value D26528_at
193 D26561_cds1_at -70 D26561_cds2_at
144 D26561_cds3_at 33 D26579_at
318 D26598_at 1764 D26599_at
1537 D26600_at 1204 D28114_at
707
raw data
Scanner
enlarged section of raw image
11
Microarray Potential Applications
  • New and better molecular diagnostics
  • New molecular targets for therapy
  • few new drugs, large pipeline,
  • Outcome depends on genetic signature
  • best treatment?
  • Fundamental Biological Discovery
  • finding and refining biological pathways
  • Personalized medicine ?!

12
Microarray Data Mining Challenges
  • Avoiding false positives, due to
  • too few records (samples), usually lt 100
  • too many columns (genes), usually gt 1,000
  • Model needs to be robust in presence of noise
  • For reliability need large gene sets for
    diagnostics or drug targets, need small gene sets
  • Estimate class probability
  • Model needs to be explainable to biologists

13
False Positives in Astronomy
cartoon used with permission
14
CATs Clementine Application Templates
  • CATs - examples of complete data mining processes
  • Microarray CAT

Preparation
Multi- Class
Clustering
2-Class
15
Key Ideas
  • Capture the complete process
  • X-validation loop w. feature selection inside
  • Randomization to select significant genes
  • Internal iterative feature selection loop
  • For each class, separate selection of optimal
    gene sets
  • Neural nets robust in presence of noise
  • Bagging of neural nets

16
Microarray Classification
Train data
Feature and Parameter Selection
Data
Model Building
Evaluation
Test data
17
Classification External X-val
Gene Data
Train data
Feature and Parameter Selection
T r a i n
Data
Model Building
Evaluation
Test data
FinalTest
Final Model
Final Results
18
Measuring false positives with randomization
Rand Class
Gene
Class
178 105 4174 7133
1 1 2 2
2 1 1 2
Randomize 500 times
Gene
Class
Bottom 1 T-value -2.08 Select potentially
interesting genes at 1
178 105 4174 7133
2 1 1 2
19
Gene Reduction improves Classification
  • most learning algorithms look for non-linear
    combinations of features -- can easily find many
    spurious combinations given small of records
    and large of genes
  • Classification accuracy improves if we first
    reduce of genes by a linear method, e.g.
    T-values of mean difference
  • Heuristic select equal genes from each class
  • Then apply a favorite machine learning algorithm

20
Iterative Wrapper approach to selecting the best
gene set
  • Test models using 1,2,3, , 10, 20, 30, 40, ...,
    100 top genes with x-validation.
  • Heuristic 1 evaluate errors from each class
    select number of genes from each class that
    minimizes error for that class
  • For randomized algorithms, average 10
    Cross-validation runs!
  • Select gene set with lowest average error

21
Clementine stream for subset selection by
x-validation
22
Microarrays ALL/AML Example
  • Leukemia Acute Lymphoblastic (ALL) vs Acute
    Myeloid (AML), Golub et al, Science, v.286, 1999
  • 72 examples (38 train, 34 test), about 7,000
    genes
  • well-studied (CAMDA-2000), good test example

ALL
AML
Visually similar, but genetically very different
23
Gene subset selection one X-validation
Single Cross-Validation run
24
Gene subset selection multiple cross-validation
runs
For ALL/AML data, 10 genes per class had the
lowest error (lt1)
Point in the center is the average error from 10
cross-validation runs Bars indicate 1 st.
dev above and below
25
ALL/AML Results on the test data
  • Genes selected and model trained on Train set
    ONLY!
  • Best Net with 10 top genes per class (20 overall)
    was applied to the test data (34 samples)
  • 33 correct predictions (97 accuracy),
  • 1 error on sample 66
  • Actual Class AML, Net prediction ALL
  • other methods consistently misclassify sample 66
    -- misclassified by a pathologist?

26
Pediatric Brain Tumour Data
  • 92 samples, 5 classes (MED, EPD, JPA, EPD, MGL,
    RHB) from U. of Chicago Childrens Hospital
  • Outer cross-validation with gene selection inside
    the loop
  • Ranking by absolute T-test value (selects top
    positive and negative genes)
  • Select best genes by adjusted error for each
    class
  • Bagging of 100 neural nets

27
Selecting Best Gene Set
  • Minimizing Combined Error for all classes is not
    optimal

Average, high and low error rate for all classes
28
Error rates for each class
Error rate
Genes per Class
29
Evaluating One Network
Averaged over 100 Networks
Class Error rate
MED 2.1
MGL 17
RHB 24
EPD 9
JPA 19
ALL 8.3
30
Bagging 100 Networks
Class Individual Error Rate Bag Error rate Bag Avg Conf
MED 2.1 2 (0) 98
MGL 17 10 83
RHB 24 11 76
EPD 9 0 91
JPA 19 0 81
ALL 8.3 3 (2) 92
  • Note suspected error on one sample (labeled as
    MED but consistently classified as RHB)

31
AF1q New Marker for Medulloblastoma?
  • AF1Q ALL1-fused gene from chromosome 1q
  • transmembrane protein
  • Related to leukemia (3 PUBMED entries) but not to
    Medulloblastoma

32
Future directions for Microarray Analysis
  • Algorithms optimized for small samples
  • Integration with other data
  • biological networks
  • medical text
  • protein data
  • Cost-sensitive classification algorithms
  • error cost depends on outcome (dont want to miss
    treatable cancer), treatment side effects, etc.

33
Acknowledgements
  • Eric Bremer, Childrens Hospital (Chicago)
    Northwestern U.
  • Greg Cooper, U. Pittsburgh
  • Tom Khabaza, SPSS
  • Sridhar Ramaswamy, MIT/Whitehead Institute
  • Pablo Tamayo, MIT/Whitehead Institute

34
Thank you
  • Further resources on Data Mining
    www.KDnuggets.com
  • Microarrays
  • www.KDnuggets.com/websites/microarray.html
  • Contact
  • Gregory Piatetsky-Shapiro www.kdnuggets.com/gps.h
    tml
Write a Comment
User Comments (0)
About PowerShow.com