Title: NetRaVE: A novel method for constructing gene networks from microarray data
1NetRaVE A novel method for constructing gene
networks from microarray data
Bill Wilson CSIRO Mathematics and Information
Sciences North Ryde, Sydney
www.csiro.au
2Microarray data analysis
- The usual analysis!
- Differential gene expression
- List of Top100 genes
- Lacks any connections or structure
- Where to start?
- Stop looking at favourite genes!!
3Microarray data analysis
- Current analysis paradigm
- Gene expression values are treated as a response
to a condition, not a predictor. - GeneRaVE analysis
- use gene expression as a predictor of a response
- analyse gene expression values to find a small
set of genes predictive of a response. - extend this to find a local network of genes
close to this response
4Gene RaVE in action
- Properties of the algorithm
- A supervised approach that requires no
preselection of variables. - Parameter estimation and variable selection
occur simultaneously. - Significance of our results are assayed by
randomly permuting the labels or sample classes
and rebuilding our model. What is the chance we
get the same results by chance? - V-fold cross validation of results is also used,
to obtain an estimate of overall prediction
error, and to ensure no overfitting of our model. - (build model on v-1 groups, predict the group
left out) - Used when independent test set not available ..
usually always.
5Network building method
- Identify genes predictive of our response
- Regressions (linear or trees) are carried out
using the expression of the classifier genes as
our response, predictors are the expression
levels of the remaining genes - repeat this process using the genes selected in
the regression as the response variables in the
next round - Crossvalidate to prevent over fitting. Tree
pruning!
6A gene regulatory network??
7The Smoking Data
Two Classes Smokers (34) Never-smokers(23) 57
samples
Affymetrix U133A array 22, 283 genes
(probesets)
8The Smoking Data - classifiers
Can classify the different subtypes with the
expression of 3 genes. Cross validated error rate
0.000!
- CYP1B1 cytochrome P450, family 1, subfamily B,
polypeptide 1 - (xenobiotic metabolism)
- CEACAM5 carcinoembryonic antigen-related cell
adhesion molecule 5 - ALDH3A1 aldehyde dehydrogenase 3 family,
memberA1 - (protects against oxidative damage)
9The Smoking Data - Local Gene Network
10The Smoking Data - Local Gene Network
11The Smoking Data - Local Gene Network
- metabolism (Xenobiotics / Hormones)
- CYP1B1, CYP1A1 expression activated by PAHs,
cigarette smoke, catabolises estrogen, - -AKR1B10 catalyses the reduction of carbonyl
group-containing xenobiotics - AKR1C1, involved
in progesterone catabolism.
- MUC5AC, respiratory mucins. Upregulated in
response to inflammation, pathogenic factors. - MUC5B, salivary mucin.
- immediate-early genes regulators of cell
proliferation, differentiation, and
transformation - EGR1 (interacts physically with Jun),
- FOS, FOSB (all regulated by the JNK pathway
which is activated by oxidative stress) - TOB1,
supresses cell growth, antiproliferative, - - ATF3, regulated by EGR1
12The Smoking Data - Local Gene Network
- Immune system genes
- HLA-DQA1, plays a central role in the immune
system by presenting peptides derived from
extracellular proteins.- CXCL1 chemokine, - C3
complement component
CEACAM5 Carcinoembryonic antigen-related cell
adhesion molecule 5, prognostic for coloRC.
13The Smoking Data - recent publications
Recent publications showing relationships
detected in our network analysis - a form of
validation
Penning TM. AKR1B10 a new diagnostic marker of
non-small cell lung carcinoma in smokers. Clin
Cancer Res. 2005 Mar 111(5)1687-90
Bottone FG Jr, Moon Y, Alston-Mills B, Eling TE.
Transcriptional regulation of activating
transcription factor 3 involves the early growth
response-1 gene. J Pharmacol Exp Ther. 2005
Nov315(2)668-77.
Baginski TK, Dabbagh K, Satjawatcharaphong C,
Swinney DC. Cigarette Smoke Synergistically
Enhances Respiratory Mucin Induction by
Pro-inflammatory Stimuli. Am J Respir Cell Mol
Biol. 2006 Mar 16.
14St Judes Leukemia Data
5 Classes of leukemia BCR_ABL E2A_PBX1 MLL TEL_A
ML1 T_ALL others 104 samples Affymetrix U133A
/ B arrays 44,000 genes (probesets)
- Can classify the different subtypes with the
expression of 6 genes. - Cross validated error rate 0.048
- PBX1
- SHCD1A
- PCLO
- C20ORF103
- REDD2
- DNAPTP6
15St Judes Leukemia Data (500,000 probes)
5 probes identified Cross validated error rate
0.039
16ZBTB34
HIP1R
IGF-IImRNABP
ELK3
SLC27A2
INTRON of SLIC1
ZFHX1B
CCT2
KCNN1
ABTB1
PLCE1
PBX1
SCHIP2
BCL2
FLJ20313
GCSH
PCLO
Serpina6
REDD2
ZNF258
Y
FLHSD2
HLA DRB4
ALU seq
SHCD1A
HLA DPB1
PKC?
IGKC
C20orf103
DNAPTP6
TcRbVar
HLA DRB3
FBXW7
DCTN4
ABCC1
HLA DRB1
PKC?
Galectin
MRP036
HLA DQB1
AP3M1
HTLF
MRP621
17ZBTB34
HIP1R
IGF-IImRNABP
ELK3
SLC27A2
INTRON of SLIC1
ZFHX1B
CCT2
KCNN1
ABTB1
PLCE1
PBX1
SCHIP2
BCL2
FLJ20313
GCSH
PCLO
Serpina6
REDD2
ZNF258
Y
FLHSD2
HLA DRB4
ALU seq
SHCD1A
HLA DPB1
PKC?
IGKC
C20orf103
DNAPTP6
TcRbVar
HLA DRB3
FBXW7
DCTN4
ABCC1
HLA DRB1
PKC?
Galectin
MRP036
HLA DQB1
AP3M1
HTLF
MRP621
18Network edges
The expression of DCTN4 predicts the expression
we observe for DNAPTP6, and the expression of
DNAPTP6 in turn predicts the expression of the
leukemia variable.
19ZBTB34
HIP1R
IGF-IImRNABP
ELK3
Cell cycle
SLC27A2
INTRON of SLIC1
ZFHX1B
CCT2
KCNN1
ABTB1
PLCE1
PBX1
SCHIP2
BCL2
FLJ20313
GCSH
PCLO
Serpina6
LEUKEMIA
REDD2
ZNF258
FLHSD2
HLA DRB4
ALU seq
SHCD1A
HLA DPB1
PKC?
IGKC
C20orf103
DNAPTP6
TcRbVar
Protein degradation
HLA DRB3
FBXW7
DCTN4
ABCC1
HLA DRB1
PKC?
Galectin
MRP036
HLA DQB1
Immunesystem
AP3M1
HTLF
MRP621
20ZBTB34
HIP1R
IGF-IImRNABP
ELK3
Cell cycle
SLC27A2
INTRON of SLIC1
ZFHX1B
CCT2
KCNN1
ABTB1
PLCE1
PBX1
SCHIP2
BCL2
FLJ20313
GCSH
PCLO
Serpina6
LEUKEMIA
REDD2
ZNF258
FLHSD2
HLA DRB4
ALU seq
SHCD1A
HLA DPB1
PKC?
IGKC
C20orf103
DNAPTP6
TcRbVar
Protein degradation
HLA DRB3
FBXW7
DCTN4
ABCC1
HLA DRB1
PKC?
Galectin
MRP036
HLA DQB1
Immunesystem
AP3M1
HTLF
The identity of a few interesting genes in the
analysis
MRP621
21Hypothesis to the lab!
PKC?
C20orf103
Protein Kinase C, eta Regulates transcricption
factors. .. expression is highly correlated with
tumour progression in renal cell carcinoma
LEUKEMIA
IGKC
Unknown protein Highly conserved in Human, Mouse,
Rat, Fish, Chicken, C.elegans. Contains LAMP
domain. Implies association with lysosome
membrane. Conserved segments in promoter regions
of Mouse and Human genes that potentially bind
haematopoetic specific trans factors. Contains
potential FBXW7/CDC4 degron.
St Judes Leukemia dataset (Ross. M et al, Blood
2003) 104 patients 6 (ALL) leukemia
classes T-ALL E2A-PBX1 BCR-ABL TEL-AML1 MLL Hyperd
iploidgt50 Affymetrix U133A/B chips
Immunoglobulin kappa constant region (light
chain) Essential for immunoglobulin formation
FBXW7
F-Box WD-40 protein7 CDC4 Key regulator of cell
cycle. Mutated in certain carcinomas.
22Adding extra information to data analyses
What kind of information is out there? Molecular
Databases ProteinProtein interactions -
DIP Metabolic pathways Genomics
Databases Genomes! Human, Chicken, Fugu, Rat,
Mouse, Plants, blah blah Mappings! Transcripts,
mRNA, Genes, SNPs, TFBS, Things!
23NetRave - analysis of yeast microarray data
Yeast microarray data Gasch et al. Mol. Biol.
Cell 11, 4241 (2000) 172 microarrays 10
different stress conditions ProteinProtein
binary interactions from DIP The NUPP116
gene NUP116 is part of the nuclear pore complex
(the protein door that lets things in and out of
the nucleus) .. and contains a motif that binds
mRNA.
24NetRave - analysis of yeast microarray data
Linear networks with the following amount of
protein-protein interaction 0.0 0.2 0.4 0.6
250.2
260.4
270.6
28NetRaVE - analysis of microarray data
- Relationship networks from microarray data
- GeneRaVE connects together genes that are
predictive of each others expression values. - We can successfully build sparse networks from
gene expression data that make biological
sense, ie we can make a story out of them.
Bringing in extra data improves the network. - Currently working on methods to validate the
networks, and also partnering with research
groups that are interested in validating the
results in the laboratory, and testing hypotheses
from our networks.
29BHH - CMIS
Harri Kiiveri Aloke Phatak David
Mitchell Maree OSullivan Rob Dunne Glenn
Stone