Machine Learning for HighThroughput Biological Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Machine Learning for HighThroughput Biological Data

1
Machine Learning for High-Throughput Biological
Data
These notes were originally from KDD2006 tutorial
notes, written by David page at Dept.
Biostatistics and Medical Informatics Dept.
Computer Sciences University of
Wisconsin-Madison. http//www.biostat.wisc.edu/pa
ge/PageKDD2006.ppt
2
Some Data Types Well Discuss

Gene expression microarray
Single-nucleotide polymorphisms (??????????)
Mass spectrometry proteomics (????? ) and
metabolomics (????? )
Protein-protein interactions (from
co-immunoprecipitation)
High-throughput screening of potential drug
molecules

3
image from the DOE Human Genome
Program http//www.ornl.gov/hgmis
4
How Microarrays Work
Probes (DNA)
Labeled Sample (RNA)
Hybridization
Gene Chip Surface
5
Two Views of Microarray Data

Data points are genes
Represented by expression levels across different
samples (ie, featuressamples)
Goal categorize new genes
Data points are samples (eg, patients)
Represented by expression levels of different
genes (ie, featuresgenes)
Goal categorize new samples

6
Two Ways to View The Data
7
Data Points are Genes
8
Data Points are Samples
9
Supervision Add Class Values
10
Supervised Learning Task

Given a set of microarray experiments, each done
with mRNA from a different patient (same cell
type from every patient) Patients expression
values for each gene constitute the features, and
patients disease constitutes the class
Do Learn a model that accurately predicts
class based on features

11
Location in Task Space
12
Leukemia (Golub et al., 1999)

Classes Acute Lymphoblastic Leukemia(?????)
(ALL) and Acute Myeloid Leukemia (?????) (AML)
Approach Weighted voting (essentially naïve
Bayes)
Cross-Validated Accuracy Of 34 samples,
declined to predict 5, correct on other 29

13
Cancer vs. Normal

Relatively easy to predict accurately, because so
much goes haywire in cancer cells
Primary barrier is noise in the data impure RNA,
cross-hybridization, etc
Studies include breast, colon (??), prostate
(???), lymphoma (???), and multiple myeloma (???)

14
X-Val Accuracies for Multiple Myeloma (74 MM vs.
31 Normal)
15
More MM (300), Benign Condition MGUS (Hardin et
al., 2004)
16
ROC Curves Cancer vs. Normal
17
ROC Cancer vs. Benign (MGUS)
18
Work by Statisticians Outside of Standard
Classification/Clustering

Methods to better convert Affymetrixs low-level
intensity measurements into expression levels
e.g., work by Speed, Wong, Irrizary
Methods to find differentially expressed genes
between two samples, e.g. work by Newton and
Kendziorski
But the following is most related

19
Ranking Genes by Significance

Some biologists dont want one predictive model,
but a rank-ordered list of genes to explore
further (with estimated significance)
For each gene we have a set of expression levels
under our conditions, say cancer vs. normal
We can do a t-test to see if the mean expression
levels are different under the two conditions
p-value
Multiple comparisons problem if we repeat this
test for 30,000 genes, some will pop up as
significant just by chance alone
Could do a Bonferoni correction (multiply
p-values by 30,000), but this is drastic and
might eliminate all

20
False Discovery Rate (FDR) Storey and
Tibshirani, 2001

Addresses multiple comparisons but is less
extreme than Bonferoni
Replaces p-value by q-value fraction of genes
with this p-value or lower that really dont have
different means in the two classes (false
discoveries)
Publicly available in R as part of Bioconductor
package
Recommendation Use this in addition to your
supervised data mining your collaborators will
want to see it

21
FDR Highlights Difficulties Getting Insight into
Cancer vs. Normal
22
Using Benign Condition Instead of Normal Helps
Somewhat
23
Question to Anticipate

Youve run a supervised data mining algorithm on
your collaborators data, and you present an
estimate of accuracy or an ROC curve (from X-val)
How did you adjust this for the multiple
comparisons problem?
Answer you dont need to because you commit to a
single predictive model before ever looking at
the test data for a foldthis is only one
comparison

24
Prognosis and Treatment

Features same as for diagnosis
Rather than disease state, class value becomes
life expectancy with a given treatment (or
positive response vs. no response to given
treatment)

25
Breast Cancer Prognosis (Vant Veer et al., 2002)

Classes good prognosis (no metastasis within
five years of initial diagnosis) vs. poor
prognosis
Algorithm Ensemble of voters
Results 83 cross-validated accuracy on 78
cases

26
A Lesson

Previous work selected features to use in
ensemble by looking at the entire data set
Should have repeated feature selection on each
cross-val fold
Authors also chose ensemble size by seeing which
size gave highest cross-val result
Authors corrected this in web supplementaccuracy
went from 83 to 73
Remember to tune parameters separately for each
cross-val fold!

27
Prognosis with Specific Therapy (Rosenwald et
al., 2002)

Data set contains gene-expression patterns for
160 patients with diffuse large B-cell lymphoma,
receiving anthracycline chemotherapy
Class label is five-year survival
One test-train split 80/80
True positive rate 60 False negative rate
39

28
Some Future Directions

Using gene-chip data to select therapy Predict
which therapy gives best prognosis for
patient
Combining Gene Expression Data with Clinical Data
such as Lab Results, Medical and Family History
Multiple relational tables, may benefit from
relational learning

29
Unsupervised Learning Task

Given a set of microarray experiments under
different conditions
Do cluster the genes, where a gene described by
its expression levels in different experiments

30
Location in Task Space
31
Example(Green up-regulated, Red
down-regulated)
Genes
Experiments (Samples)
32
Visualizing Gene Clusters (eg, Sharan and
Shamir, 2000)
Gene Cluster 1, size20
Gene Cluster 2, size43
Time (10-minute intervals)
33
Unsupervised Learning Task 2

Given a set of microarray experiments (samples)
corresponding to different conditions or
patients
Do cluster the experiments

34
Location in Task Space
35
Examples

Cluster samples from mice subjected to a variety
of toxic compounds (Thomas et al., 2001)
Cluster samples from cancer patients, potentially
to discover different subtypes of a cancer
Cluster samples taken at different time points

36
Some Biological Pathways

Regulatory pathways
Nodes are labeled by genes
Arcs denote influence on transcription
G1 codes for P1, P1 inhibits G2s transcription
Metabolic pathways
Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins and modified proteins)
Arcs from biochemical reaction inputs to outputs
Arcs labeled by enzymes (themselves proteins)

37
Metabolic Pathway Example
H20
HSCoA
Citrate
cis-Aconitate
Acetyl CoA
citrate synthase
aconitase
H20
Oxaloacetate
NADH
MDH
(Krebs Cycle, TCA Cycle, Citric Acid Cycle)
Isocitrate
NAD
NAD
Malate
IDH
NADH CO2
fumarase
H20
a-Ketoglutarate
NAD HSCoA
Fumarate
a-KDGH
NADH CO2
succinate thikinase
Succinyl-CoA
FADH2
Succinate
FAD
GDP Pi
GTP
HSCoA
38
Regulatory Pathway (KEGG)
39
Using Microarray Data Only

Regulatory pathways
Nodes are labeled by genes
Arcs denote influence on transcription
G1 codes for P1, P1 inhibits G2s transcription
Metabolic pathways
Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins, and modified proteins)
Arcs from biochemical reaction inputs to outputs
Arcs labeled by enzymes (themselves proteins)

40
Supervised Learning Task 2

Given a set of microarray experiments for same
organism under different conditions
Do Learn graphical model that accurately
predicts expression of some genes in terms of
others

41
Some Approaches to Learning Regulatory Networks

Bayes Net Learning (started with Friedman
Halpern, 1999, well see more)
Boolean Networks (Akutsu, Kuhara, Maruyama
Miyano, 1998 Ideker, Thorsson Karp, 2002)
Related Graphical Approaches (Tanay Shamir,
2001 Chrisman, Langley, Baay Pohorille, 2003)

42
Bayesian Network (BN)
Note direction of arrow indicates dependence not
causality
43
Problem Not Causality
A
B
A is a good predictor of B. But is A regulating
B?? Ground truth might be
B
A
A
C
B
B
C
A
C
Or a more complicated variant
B
A
44
Approaches to Get Causality

Use knock-outs (Peer, Regev, Elidan and
Friedman, 2001). But not available in most
organisms.
Use time-series data and Dynamic Bayesian
Networks (Ong, Glasner and Page, 2002). But even
less data typically.
Use other data sources, eg sequences upstream of
genes, where transcription regulators may bind.
(Segal, Barash, Simon, Friedman and Koller, 2002
Noto and Craven, 2005)

45
A Dynamic Bayes Net
46
Problem Not Enough Data Points to Construct
Large Network

Fortunate to get 100s of chips
But have 1000s of genes
E. coli 4000
Yeast 6000
Human 30,000
Want to learn causal graphical model over 1000s
of variables with 100s of examples (settings of
the variables)

47
Advance Module Networks Segal, Peer, Regev,
Koller Friedman, 2005

Cluster genes by similarity over expression
experiments
All genes in a cluster are tied together same
parents and CPDs
Learn structure subject to this tying together of
genes
Iteratively re-form clusters and re-learn
network, in an EM-like fashion

48
Problem Data are Continuous but Models are
Discrete

Gene chips provide a real-valued mRNA measurement
Boolean networks and most practical Bayes net
learning algorithms assume discrete variables
May lose valuable information by discretizing

49
Advance Use of Dynamic Bayes Nets with
Continuous Variables Segal, Peer, Regev, Koller
Friedman, 2005

Expression measurements used instead of
discretized (up, down, same)
Assume linear influence of parents on children
(Michaelis-Menten assumption)
Work so far constructed the network from
literature and learned parameters

50
Problem Much Missing Information

mRNA from gene 1 doesnt directly alter level of
mRNA from gene 2
Rather, the protein product from gene 1 may alter
level of mRNA from gene 2 (e.g., transcription
factor)
Activation of transcription factor might not
occur by making more of it, but just by
phosphorylating it (post-translational
modification)

51
Example Transcription Regulation
Operon
Operon
DNA
52
Approach Measure More Stuff

Mass spectrometry (later) can measure protein
rather than mRNA
Doesnt measure all proteins
Not very quantitative (presence/absence)
2D gels can measure post-translational
modifications, but still low-throughput because
of current analysis
Co-immunoprecipitation (later), Yeast 2-Hybrids
can measure protein interactions, but noisy

53
Another Way Around Limitations

Identify smaller part of the task that is a step
toward a full regulatory pathway
Part of a pathway
Classes or groups of genes
Examples
Chromatin remodelers Predicting the operons
in E. coli

54
Chromatin Remodelers and Nucleosome Segal et al.
2006

Previous DNA picture oversimplified
DNA double-helix is wrapped in further complex
structure
DNA is accessible only if part of this structure
is unwound
Can we predict what chromatin remodelers act on
what parts of DNA, also what activates a
remodeler?

55
The E. Coli Genome
56
Finding Operons in E. coli(Craven, Page,
Shavlik, Bockhorst and Glasner, 2000)
g3
g2
g4
g5
g1
promoter
terminator

Given known operons and other E. coli data
Do predict all operons in E. coli
Additional Sources of Information
gene-expression data
functional annotation

57
Comparing Naive Bayes and Decision Trees (C5.0)
58
Using Only Individual Features
59
Single-Nucleotide Polymorphisms

SNPs Individual positions in DNA where
variation is common
Roughly 2 million known SNPs in humans
New Affymetrix whole-genome scan measures 500,000
of these
Easier/faster/cheaper to measure SNPs than to
completely sequence everyone
Motivation

60
If We Sequenced Everyone
Susceptible to Disease D or Responds to Treatment
T
Not Susceptible or Not Responding
61
Example of SNP Data
62
Phasing (Haplotyping)
63
Advantages of SNP Data

Persons SNP pattern does not change with time or
disease, so it can give more insight into
susceptibility
Easier to collect samples (can simply use blood
rather than affected tissue)

64
Challenges of SNP Data

Unphased Algorithms exist for phasing
(haplotyping), but they make errors and
typically need related individuals, dense
coverage
Missing values are more common than in
microarray data (though improving substantially,
down to around 1-2 now)
Many more measurements. For example, Affymetrix
human SNP chip at a half million SNPs.

65
Supervised Learning Task

Given a set of SNP profiles, each from a
different patient.
Phased nucleotides at each SNP position on each
copy of each chromosome constitute the features,
and patients disease constitutes the class
Unphased unordered pair of nucleotides at each
SNP position constitute the features, and
patients disease constitutes the class
Do Learn a model that accurately predicts
class based on features

66
Waddell et al., 2005

Multiple Myeloma, Young (susceptible) vs. Old
(less susceptible), 3000 SNPs, best at 64 acc
(training)
SVM with feature selection (repeated on every
fold of cross-validation) 72 accuracy, also
naïve Bayes. Significantly better than chance.

67
Listgarten et al., 2005

SVMs from SNP data predict lung cancer
susceptibility at 69 accuracy
Naïve Bayes gives similar performance
Best single SNP at less than 60 accuracy
(training)

68
Lessons

Supervised data mining algorithms can predict
disease susceptibility at rates better than
chance and better than individual SNPs
Accuracies much lower than we see with microarray
data, because were predicting who will get
disease, not who already has it

69
Future Directions

Pharmacogenetics predicting drug response from
SNP profile
Drug Efficacy
Adverse Reaction
Combining SNP data with other data types, such as
clinical (history, lab tests) and microarray

70
Proteomics

Microarrays are useful primarily because mRNA
concentrations serve as surrogate for protein
concentrations
Like to measure protein concentrations directly,
but at present cannot do so insame
high-throughput manner
Proteins do not have obvious direct complements
Could build molecules that bind, but binding
greatly affected by protein structure

71
Time-of-Flight (TOF) Mass Spectrometry (thanks
Sean McIlwain)
Detector

Measures the time for an ionized particle,
starting from the sample plate, to hit the
detector

Laser
Sample
V
72
Time-of-Flight (TOF) Mass Spectrometry 2
Detector

Matrix-Assisted Laser Desorption-Ionization
(MALDI)
Crystalloid structures made using proton-rich
matrix molecule
Hitting crystalloid with laser causes molecules
to ionize and fly towards detector

Laser
Sample
V
73
Time-of-Flight Demonstration 0
Sample Plate
74
Time-of-Flight Demonstration 1
Matrix Molecules
75
Time-of-Flight Demonstration 2
Protein Molecules
76
Time-of-Flight Demonstration 3
Laser
Detector
Positive Charge
10KV
77
Time-of-Flight Demonstration 4
Proton kicked off matrix molecule onto another
molecule
Laser pulsed directly onto sample

10KV
78
Time-of-Flight Demonstration 5
Lots of protons kicked off matrix ions, giving
rise to more positively charged molecules

10KV
79
Time-of-Flight Demonstration 6
The high positive potential under sample plate,
causes positively charged molecules to accelerate
towards detector

10KV
80
Time-of-Flight Demonstration 7

Smaller mass molecules hit detector first, while
heavier ones detected later

10Kv
81
Time-of-Flight Demonstration 8

The incident time measured from when laser is
pulsed until molecule hits detector

10KV
82
Time-of-Flight Demonstration 9

Experiment repeated a number of times, counting
frequencies of flight-times
10KV
83
Example Spectra from a Competition by Lin et al.
at Duke
These are different fractions from the same
sample.
Intensity
M/Z
84
Trypsin-Treated Spectra
Frequency
M/Z
85
Many Challenges Raised by Mass Spectrometry Data

Noise extra peaks from handling of sample, from
machine and environment (electrical noise), etc.
M/Z values may not align exactly across spectra
(resolution 0.1)
Intensities not calibrated across spectra
quantification is difficult
Cannot get all proteins typically only several
hundred. To improve odds of getting the ones we
want, may fractionate our sample by 2D gel
electrophoresis or liquid chromatography.

86
Challenges (Continued)

Better results if partially digest proteins
(break into smaller peptides) first
Can be difficult to determine what proteins we
have from spectrum
Isotopic peaks C13 and N15 atoms in varying
numbers cause multiple peaks for a single peptide

87
Handling Noise Peak Picking

Want to pick peaks that are statistically
significant from the noise signal

Want to use these as features in our learning
algorithms.
88
Many Supervised Learning Tasks

Learn to predict proteins from spectra, when the
organisms proteome is known
Learn to identify isotopic distributions
Learn to predict disease from either proteins,
peaks or isotopic distributions as features
Construct pathway models

89
Using Mass Spectrometry for Early Detection of
Ovarian Cancer Petricoin et al., 2002

Ovarian cancer difficult to detect early, often
leading to poor prognosis
Trained and tested on mass spectra from blood
serum
100 training cases, 50 with cancer
Held-out test set of 116 cases, 50 with cancer
100 sensitivity, 95 specificity (63/66) on
held-out test set

90
Not So Fast

Data mining methodology seems sound
But Keith Baggerly argues that cancer samples
were handled differently than normal samples, and
perhaps data were preprocessed differently too
If we run cancer samples Monday and normals
Wednesday, could get differences from machine
breakdown or nearby electrical equipment thats
running on Monday but not Wed
Lesson tell collaborators they must randomize
samples for the entire processing phase and of
course all our preprocessing must be same
Debate is still raging results not replicated in
trials

91
Other Proteomics 3D Structures
92
Other Proteomics Interactions
Figure from Ideker et al., Science
292(5518)929-934, 2001

each node represents a gene product (protein)
blue edges show direct protein-protein
interactions
yellow edges show interactions in which one
protein binds to DNA and affects the expression
of another

93
Protein-Protein Interactions

Yeast 2-Hybrid
Immunoprecipitation
Antibodies (immuno) are made by combinatorial
combinations of certain proteins
Millions of antibodies can be made, to recognize
a wide variety of different antigens
(invaders), often by recognizing specific
proteins

antibody
protein
94
Protein-Protein Interactions
95
Immunoprecipitation
antibody
96
Co-Immunoprecipitation
antibody
97
Many Supervised Learning Tasks

Learn to predict protein-protein interactions
protein 3D structures may be critical
Use protein-protein interactions in construction
of pathway models
Learn to predict protein function from
interaction data

98
ChIP-Chip Data

Immunoprecipitation can also be done to identify
proteins interacting with DNA rather than other
proteins
Chromatin immunoprecipitation (ChIP) grab sample
of DNA bound to a particular protein
(transcription factor)
ChIP-Chip run this sample of DNA on a microarray
to see which DNA was bound
Example of analysis of such new data Keles et
al., 2006

99
Metabolomics

Measures concentration of each low-molecular
weight molecule in sample
These typically are metabolites, or small
molecules produced or consumed by reactions in
biochemical pathways
These reactions typically catalyzed by proteins
(specifically, enzymes)
This data typically also mass spectrometry,
though could also be NMR

100
Lipomics

Analogous to metabolomics, but measuring
concentrations of lipids rather than metabolites
Potentially help induce biochemical pathway
information or to help disease diagnosis or
treatment choice

101
To Design a Drug
Identify Target Protein
Knowledge of proteome/genome
Relevant biochemical pathways
Crystallography, NMR Difficult if Membrane-Bound
Determine Target Site Structure
Synthesize a Molecule that Will Bind
Imperfect modeling of structure Structures may
change at binding And even then
102
Molecule Binds Target But May

Bind too tightly or not tightly enough.
Be toxic.
Have other effects (side-effects) in the body.
Break down as soon as it gets into the body, or
may not leave the body soon enough.
It may not get to where it should in the body
(e.g., crossing blood-brain barrier).
Not diffuse from gut to bloodstream.

103
And Every Body is Different

Even if a molecule works in the test tube and
works in animal studies, it may not work in
people (will fail in clinical trials).
A molecule may work for some people but not
others.
A molecule may cause harmful side-effects in some
people but not others.

104
Typical Practice when Target Structure is Unknown

High-Throughput Screening (HTS) Test many
molecules (1,000,000) to find some that bind to
target (ligands).
Infer (induce) shape of target site from 3D
structural similarities.
Shared 3D substructure is called a pharmacophore.
Perfect example of a machine learning task with
spatial target.

105
An Example of Structure Learning
Inactive
Active
106
Common Data Mining Approaches

Represent a molecule by thousands to millions of
features and use standard techniques (e.g., KDD
Cup 2001)
Represent each low-energy conformer by feature
vector and use multiple-instance learning (e.g.,
Jain et al., 1998)
Relational learning
Inductive logic programming (e.g., Finn et al.,
1998)
Graph mining

107
Supervised Learning Task

Given a set of molecules, each labeled by
activity -- binding affinity for target protein
-- and a set of low-energy conformers for each
molecule
Do Learn a model that accurately predicts
activity (may be Boolean or real-valued)

108
ILP as Illustration The Logical Representation
of a Pharmacophore
109
Background Knowledge I

Information about atoms and bonds in the
molecules
atm(m1,a1,o,3,5.915800,-2.441200,1.799700).
atm(m1,a2,c,3,0.574700,-2.773300,0.337600).
atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).
bond(m1,a1,a2,1).
bond(m1,a2,a3,1).

110
Background knowledge II

Definition of distance equivalence
dist(Drug,Atom1,Atom2,Dist,Error)-
number(Error),
coord(Drug,Atom1,X1,Y1,Z1),
coord(Drug,Atom2,X2,Y2,Z2),
euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1),
Diff is Dist1-Dist,
absolute_value(Diff,E1),
E1 lt Error.
euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D)-
Dsq is (X1-X2)2(Y1-Y2)2(Z1-Z2)2,
D is sqrt(Dsq).

111
Central Idea Generalize by searching a lattice
112
Conformational model

Conformational flexibility modelled as multiple
conformations
Sybyl randomsearch
Catalyst

113
Pharmacophore description

Atom and site centred
Hydrogen bond donor
Hydrogen bond acceptor
Hydrophobe
Site points (limited at present)
User definable
Distance based

114
Example 1 Dopamine agonists

Agonists taken from Martin data set on QSAR
society web pages
Examples (5-50 conformations/molecule)

115
Pharmacophore identified

Molecule A has the desired activity if
in conformation B molecule A contains a
hydrogen acceptor at C, and
in conformation B molecule A contains a basic
nitrogen group at D, and
the distance between C and D is 7.05966 /-
0.75 Angstroms, and
in conformation B molecule A contains a
hydrogen acceptor at E, and
the distance between C and E is 2.80871 /-
0.75 Angstroms, and
the distance between D and E is 6.36846 /-
0.75 Angstroms, and
in conformation B molecule A contains a
hydrophobic group at F, and
the distance between C and F is 2.68136 /-
0.75 Angstroms, and
the distance between D and F is 4.80399 /-
0.75 Angstroms, and
the distance between E and F is 2.74602 /-
0.75 Angstroms.

116
Example II ACE inhibitors

28 angiotensin converting enzyme inhibitors taken
from literature
D. Mayer et al., J. Comput.-Aided Mol. Design, 1,
3-16, (1987)

117
ACE pharmacophore

Molecule A is an ACE inhibitor if
molecule A contains a zinc-site B,
molecule A contains a hydrogen acceptor C,
the distance between B and C is 7.899 /-
0.750 A,
molecule A contains a hydrogen acceptor D,
the distance between B and D is 8.475 /-
0.750 A,
the distance between C and D is 2.133 /-
0.750 A,
molecule A contains a hydrogen acceptor E,
the distance between B and E is 4.891 /-
0.750 A,
the distance between C and E is 3.114 /-
0.750 A,
the distance between D and E is 3.753 /-
0.750 A.

118
Pharmacophore discovered
Zinc site H-bond acceptor
119
Additional Finding

Original pharmacophore rediscovered plus one
other
different zinc ligand position
similar to alternative proposed by Ciba-Geigy

120
Example III Thermolysin inhibitors

10 inhibitors for which crystallographic data is
available in PDB
Conformationally challenging molecules
Experimentally observed superposition

121
Key binding site interactions
Asn112-NH
OC Asn112
S2
Arg203-NH
S1
OC Ala113
Zn
122
Interactions made by inhibitors
123
Pharmacophore Identification

Structures considered 1HYT 1THL 1TLP 1TMN 2TMN
4TLN 4TMN 5TLN 5TMN 6TMN
Conformational analysis using Best conformer
generation in Catalyst
98-251 conformations/molecule

124
Thermolysin Results

10 5-point pharmacophore identified, falling into
2 groups (7/10 molecules)
3 acceptors, 1 hydrophobe, 1 donor
4 acceptors, 1 donor
Common core of Zn ligands, Arg203 and Asn112
interactions identified
Correct assignments of functional groups
Correct geometry to 1 Angstrom tolerance

125
Thermolysin results

Increasing tolerance to 1.5Angstroms finds common
6-point pharmacophore including one extra
interaction

126
Example IV Antibacterial peptides Spatola et
al., 2000

Dataset of 11 pentapeptides showing activity
against Pseudomonas aeruginosa
6 actives lt64mg/ml IC50
5 inactives

127
Pharmacophore Identified
A Molecule M is active against Pseudomonas
Aeruginosa if it has a conformation B such
that M has a hydrophobic group C, M has a
hydrogen acceptor D, the distance between C and
D in conformation B is 11.7 Angstroms M has a
positively-charged atom E, the distance between
C and E in conformation B is 4 Angstroms the
distance between D and E in conformation B is 9.4
Angstroms M has a positively-charged atom
F, the distance between C and F in conformation
B is 11.1 Angstroms the distance between D and F
in conformation B is 12.6 Angstroms the distance
between E and F in conformation B is 8.7
Angstroms Tolerance 1.5 Angstroms
128
(No Transcript)
129
Clinical Databases of the Future (Dramatically
Simplified)
PatientID Date Physician Symptoms
Diagnosis P1 1/1/01 Smith
palpitations hypoglycemic P1 2/1/03
Jones fever, aches influenza
PatientID Gender Birthdate P1 M
3/22/63
PatientID Date Lab Test Result
PatientID SNP1 SNP2 SNP500K P1
AA AB BB P2
AB BB AA
P1 1/1/01 blood glucose 42
P1 1/9/01 blood glucose 45
PatientID Date Prescribed Date Filled
Physician Medication Dose Duration
P1 5/17/98 5/18/98
Jones prilosec 10mg 3
months
130
Final Wrap-up

Molecular biology collecting lots and lots of
data in post-genome era
Opportunity to connect molecular-level
information to diseases and treatment
Need analysis tools to interpret
Data mining opportunities abound
Hopefully this tutorial provided solid start
toward applying data mining to high-throughput
biological data

131
Thanks To

Jude Shavlik
John Shaughnessy
Bart Barlogie
Mark Craven
Sean McIlwain
Jan Struyf
Arno Spatola
Paul Finn
Beth Burnside

Michael Molla
Michael Waddell
Irene Ong
Jesse Davis
Soumya Ray
Jo Hardin
John Crowley
Fenghuang Zhan
Eric Lantz

132
If Time Permits some of my groups directions in
the area

Clinical Data (with Jesse Davis, Beth Burnside,
M.D.)
Addressing another problem with current
approaches to biological network learning (with
Soumya Ray, Eric Lantz)

133
Using Machine Learning with Clinical Histories
Example

Well use example of Mammography to show some
issues that arise
These issues arise here even with just one
relational table
These issues are even more pronounced with data
in multiple tables

134
Supervised Learning Task

Given a database of mammogram abnormalities for
different patients (same cell type from every
patient) Radiologist-entered values describing
the abnormality constitute the features, and
abnormalitys biopsy result as benign or
malignant constitutes the class
Do Learn a model that accurately predicts
class based on features

135
Mammography DatabaseDavis et al, 2005 Burnside
et al, 2005
136
Original Expert Structure
137
Level 1 Parameters
Given Features (node labels, or fields in
database), Data, Bayes net structure Learn
Probabilities. Note probabilities needed are
Pr(Be/Mal), Pr(ShapeBe/Mal), Pr (SizeBe/Mal)
138
Level 2 Structure
Be/Mal
Given Features, Data Learn Bayes
net structure and probabilities. Note with this
structure, now will need Pr(SizeShape,Be/Mal)
instead of Pr(SizeBe/Mal).
Shape
Size
139
Mammography Database
140
Mammography Database
141
Mammography Database
142
Level 3 Aggregates
Given Features, Data, Background knowledge
aggregation functions such as average, mode, max,
etc. Learn Useful aggregate features,
Bayes net structure that uses these features, and
probabilities. New features may use other
rows/tables.
Avg size this date
Be/Mal
Shape
Size
143
Mammography Database
144
Mammography Database
145
Mammography Database
146
Level 4 View Learning
Given Features, Data, Background knowledge
aggregation functions and intensionally-defined
relations such as increase or same
location Learn Useful new features defined by
views (equivalent to rules or SQL queries), Bayes
net structure, and probabilities.
Shape change in abnormality at this location
Increase in average size of abnormalities
Avg size this date
Be/Mal
Shape
Size
147
Example of Learned Rule
is_malignant(A) IF 'BIRADS_category'(A,b5),
'MassPAO'(A,present), 'MassesDensity'(A,high),
'HO_BreastCA'(A,hxDCorLC), in_same_mammogram(A,B
), 'Calc_Pleomorphic'(B,notPresent),
'Calc_Punctate'(B,notPresent).
148
ROC Level 2 (TAN) vs. Level 1
149
Precision-Recall Curves
150
(No Transcript)
151
SAYU-View

Improved View Learning approach
SAYU Score As You Use
For each candidate rule, add it to the Bayesian
network and see if it improves the networks
score
Only add a rule (new field for the view) if it
improves the Bayes net

152
(No Transcript)
153
(No Transcript)
154
Clinical Databases of the Future (Dramatically
Simplified)
PatientID Date Physician Symptoms
Diagnosis P1 1/1/01 Smith
palpitations hypoglycemic P1 2/1/03
Jones fever, aches influenza
PatientID Gender Birthdate P1 M
3/22/63
PatientID Date Lab Test Result
PatientID SNP1 SNP2 SNP500K P1
AA AB BB P2
AB BB AA
P1 1/1/01 blood glucose 42
P1 1/9/01 blood glucose 45
PatientID Date Prescribed Date Filled
Physician Medication Dose Duration
P1 5/17/98 5/18/98
Jones prilosec 10mg 3
months
155
Another Problem with Current Learning of
Regulatory Models

Current techniques all use greedy heuristic
Bayes net learning algorithms use sparse
candidate approach to be considered as a parent
of gene 1, another gene 2 must be correlated with
gene 1
CPDs often represented as trees use greedy tree
learning algorithms
All can fall prey to functions such as
exclusive-or do these arise?

156
Skewing Example Page Ray, 2003 Ray Page,
2004 Rosell et al., 2005 Ray Page, 2005
Drosophila survival based on gender and Sxl gene
activity
157
Hard Functions

Our Definition those functions for which no
attribute has gain according to standard purity
measures (GINI, Entropy)
NOTE Hard does not refer to size of
representation
Example n-variable odd parity
Many others

158
Learning Hard Functions

Standard method of learning hard functions (e.g.
with decision trees) depth-k Lookahead
O(mn2k1-1) for m examples in n variables
We devise a technique that allows learning
algorithms to efficiently learn hard functions

159
Key Idea

Hard functions are not hard for all data
distributions
We can skew the input distribution to simulate a
different one
By randomly choosing preferred values for
attributes
Accumulate evidence over several skews to select
a split attribute

160
Example Uniform Distribution
161
Example Skewed Distribution(Sequential
Skewing, Ray Page, ICML 2004)

Write a Comment

User Comments (0)

About PowerShow.com

Machine Learning for HighThroughput Biological Data PowerPoint PPT Presentation