Title: Systems biology: identification of regulatory regions and disease causing genes and mechanisms
1Systems biology identification of regulatory
regions and disease causing genes and mechanisms
- PhD defense
- Peter Van Loo
- Promotor P. Marynen
- Co-promotors B. De Moor
- and C. De Wolf-Peeters
- Human Genome Laboratory
- Departement of Human Genetics
- May 23th, 2008
2Introduction
Gene expression
Regulatory regions
Genes
3Introduction
Gene expression
Regulatory regions
Genes
1
4Introduction
Gene expression
Regulatory regions
Genes
2
5Introduction
Gene expression
Regulatory regions
3
Genes
6Introduction
Systems biology identification of regulatory
regions and disease causing genes and mechanisms
7Introduction
Systems biology identification of regulatory
regions and disease causing genes and mechanisms
TSS
5 UTR
Exon
Intron
3 UTR
regulatory region
transcription factor binding site
TATA-box
gene
regulatory region
8Introduction
Systems biology identification of regulatory
regions and disease causing genes and mechanisms
- Disease gene identification
- Linkage analysis
- Cytogenetics
- Molecular cytogenetics
- e.g. array-CGH
- Candidate genes
- Validation labour intensive
? Computational gene prioritization by data
integration
9Introduction
Systems biology identification of regulatory
regions and disease causing genes and mechanisms
10Introduction
Gene expression
Regulatory regions
Genes
1
11Computational detection of regulatory regions
cis-regulatory modules
Wasserman and Sandelin, Nat Rev Genetics 2004
12A classification of existing CRM detection methods
Type I
Type II
Type III
13Computational detection of cis-regulatory
modules method overview
TSS
CDS
CDS
CNS 1
Human gene
3 UTR
5 UTR
Exon 1
Exon 2
LAGAN VISTA
Mouse gene
Set of co-regulated genes
Transcriptional Regulatory Model
p,v
CNS data- base
211 bp
MotifScanner
ModuleMiner
TFBS data- base
Transfac
ModuleScanner
Jaspar
New target genes
14The transcriptional regulatory model (TRM) -
ModuleScanner
- Collection of Position Weight Matrices
Parameters - For a transcriptional regulatory model T, we can
assign a score to a sequence s, as - TFBS transcription factor binding site
- log S(t) logarithmic score of TFBS t
- f(p,t) limitations, depending on parameters p (0
if invalid, 1 if all criteria are satisfied) - maximalisation over all binding sites of all
transcription factors in the model - ModuleScanner ranks all genes in the genome
- Highest ranking genes are putative target genes
15ModuleMiner parameterless CRM detection in sets
of co-regulated genes
- Search space all possible TRMs
v
97 bp
167 bp
p
p,v
211 bp
191 bp
p,v
43 bp
465 bp
16ModuleMiner parameterless CRM detection in sets
of co-regulated genes
- For each TRM use ModuleScanner to score all genes
v
97 bp
167 bp
p
p,v
211 bp
191 bp
p,v
43 bp
465 bp
17ModuleMiner parameterless CRM detection in sets
of co-regulated genes
- Use order statistics to assign a score to the
ranks of the co-regulated genes
Set of co-regulated genes
Rank ratios ri
Ranks
6 11 18 29 45
6.8E-4 1.2E-3 2.0E-3 3.3E-3 5.1E-3
Assign p-value
All genes in genome, ordered by transcriptional
regulatory model score
18ModuleMiner parameterless CRM detection in sets
of co-regulated genes
- Use order statistics to assign a score to each TRM
19ModuleMiner parameterless CRM detection in sets
of co-regulated genes
- Use order statistics to assign a score to each
TRM - Select the best performing TRM
p
8E-9
211 bp
20ModuleMiner parameterless CRM detection in sets
of co-regulated genes
- Use order statistics to assign a score to each
TRM - Select the best performing TRM
- Genetic algorithm based optimization
p
8E-9
211 bp
21In silico validation leave-one-out
cross-validation
High quality set of 12 smooth muscle specific
genes
Leave one gene out
ModuleMiner train a TRM on the other 11 genes
ModuleScanner rank the full genome
Look at the position of the left-out gene
Nelander et al. (2003), Genome Res 131838-1854
22In silico validation leave-one-out
cross-validation
Do for all possible thresholds
Sensitivity of left out genes above
threshold Specificity of genome below
threshold
Plot on ROC curve
Area Under the Curve is a measure of performance
23In silico validation leave-one-out
cross-validation
Do for all possible thresholds
Sensitivity of left out genes above
threshold Specificity of genome below
threshold
Plot on ROC curve
Area Under the Curve is a measure of performance
50
24In silico validation leave-one-out
cross-validation
Do for all possible thresholds
Sensitivity of left out genes above
threshold Specificity of genome below
threshold
Plot on ROC curve
Area Under the Curve is a measure of performance
100
25In silico validation leave-one-out
cross-validation
Do for all possible thresholds
Sensitivity of left out genes above
threshold Specificity of genome below
threshold
Plot on ROC curve
Area Under the Curve is a measure of performance
93
26Sensitivity to noise
1.0
0.9
0.8
AUC
0.7
0.6
10 smooth muscle genes random genes smooth
muscle genes random genes 10
0.5
0
1
2
3
4
of random genes / of smooth muscle genes
27Comparison to other CRM detection algorithms
ModuleMiner ModuleSearcher
CisModule EMCMODULE Random TRMs
28Application of ModuleMiner to adult tissues and
embryonic development
- 10 microarray clusters
- Genes expressed in different adult tissues
- 9/10 successful CRM detection
- 5 custom-build sets
- Embryonic development processed
- 5/5 successful CRM detection
All conserved regions Adult tissue CRM
predictions Embryonic development CRM predictions
distance from TSS
29Conclusions
- ModuleMiner
- detects similar cis-regulatory modules in
co-regulated genes - outperforms existing CRM detection algorithm on
benchmark data - detects CRMs in microarray clusters of different
adult tissues - Mostly close to TSS
- detects CRMs in custom-build embryonic
development sets - Mostly further from TSS
30Introduction
Gene expression
Regulatory regions
Genes
2
31Gene prioritization genomic data fusion
Gene expression
Anatomical expression
Protein Domains
Process/ pathway
Gene regulation
Prot-Prot Interactions
Literature
BLAST
P(gene)
32ENDEAVOUR The approach
candidate genes
n data sources
overall prioritization
n prioritizations
data source
known (training) genes
33Prioritization based on one data source
T3
- Vector-based
- Literature (text-mining)
- Microarray data
- Cis-regulatory motifs
- Attribute-based
- Gene ontology
- Protein domains
- Pathways
- Anatomical expression
- Other
- BLAST
- Cis-regulatory modules
- Protein-protein interactions
cos(?)
T2
T1
GO ID expected frequencies
full genome
GO IDs
P-value
GO IDs observed frequencies
training genes
GO IDs
Training genes
Protein sequences
Local BLAST database
Candidate genes
BLASTP
34Genomic data fusion order statistics
candidate genes
n data sources
overall prioritization
n prioritizations
data source
known (training) genes
35Cross-validation
n-th position
99 random test genes one left-out gene
data source
prioritization
Sensitivity of left out genes above
threshold Specificity of random genes below
threshold
left-out gene
Plot on ROC curve
training genes in which one gene is left out
sensitivity
sensitivity
AUC
Performance measure Area Under the Curve
0
1 - specificity
1 - specificity
0
36Cross-validation
n-th position
99 random test genes one left-out gene
all data sources
prioritization
Sensitivity of left out genes above
threshold Specificity of random genes below
threshold
left-out gene
Plot on ROC curve
training genes in which one gene is left out
sensitivity
sensitivity
AUC
Performance measure Area Under the Curve
0
1 - specificity
1 - specificity
0
37Cross-validation on disease and pathway genes
Diseases (OMIM, 627 genes, 29 diseases)
100
Pathways (GO, 76 genes, 3 pathways)
Random
80
60
AUC
40
20
0
GO
EST
BIND
KEGG
Overall prioritization
BLAST
InterPro
Literature
Transcription motifs
Microarray
Cis-regulatory modules
38Integrated case study predicting expression from
sequence
AGCTTCTCCTCTGTAGACACCGAGACTCATAACTCTGATGAGATCCACAG
TTCTATTGGAGTTGTGCAATGAAATAGCAGACACTCTTGGAATCTCTTGG
GGCTCCCCCAACTTCATGAATGAATCTCTAAGTTCTGCATGCCCCATATA
AACTGATGACAAGATCTTTGAGAGCACTGTTTCCTTAGTGGGTTTCCACA
GAGAAATTTTGAATATGGGGGTCCACGAAGTGGCTTGAGCCATCTACCCC
AACAACAACATTTGGCCTTTGGTGCCTCTCTAGTATTCTCCTGATGGTTA
TGCAGATGGTGGCATACAGAAATGGAGTAAATTAGTAAACTAAAAGAATA
AATGAGGTGCCCCATTTCTCTGACTCTATTCTAGGAAAATGAGTGAGAAG
CAGGATCTCCCAGATTTCAGGAGAGATCTGGGTCACTTTTTGGAGGTTTC
TGGTATTGAAAATTATATATATATATCCTCCAGCTGTATATATATATATA
TATATATATATATATATATATATATATATAACATCTCTATATGATATACG
TATCTATCTATACCTCTATAGATATCTATAGATATCTATCTATATCTCTA
TATGATATATAGAGATATAGATATCTCCTCCAGGTAATAGACTTAATTTT
TAAGAACATGTTTCAATTCACAGAAAAATTGAGCAGATGGTACAGAGAAT
AACCCTGTGCCCAGTTTCCCCTATGATTAACATAATACATTATATGGTAC
ACGTGTAACAGTGAAACAATATCGGTACATTATTATTCACTAAAGTTCAT
CATTGATTCAGATTTGTCTAGGTTGATCTTATGTCTTTTTGTGGCCAATT
ATTCCATCTAAGATTCGACATTATATTAAGTTGTCATGTCTCCTTAGGCT
AATCCTTGCCTGTGACAATTTCTCAGACTTTCCTTGTTTCTGATGACCTT
GATGGGCTTGAGGATTACTGGTTTTTTGTAGGACGCCCCTCTACTAGAAT
TTGTCTGATGTTTTTCTTATGATTAGACTAGTATTATGAGAGCAGGACCA
CAGAGAGAAAGAACAATTTTCACCACATCCTATCAAGAGTATATACTATC
AAGATGATTTATCATTGTTGATGTTGGTCTTAATCCCCTGGCTAAGAGAG
TGTTTGTCAGGCTTCTCCTAAGCTATTTTCCCCTGACTACCTTTCCATAC
GGAATATACTCCCCGGGAAGAAGTTACTATCTATAGCCCACAATTAAAGA
GTGTGGGTTTCTGTTTCTCCTCCTTAAGGCCGGCACATGTCTATAAATTA
TTTGGAATCCCTGTGCACATCTATATAAATAAATTTGGAATTACGGGATG
TTTGTCTTTTCTCTCTGGTTTATTAATTTACTTAATAATTTATTTATAAT
AGTATGGACTCATTACTTTTTTTTTTTTTTTTTTTTTGAGAAGGAGCCTC
ACTCTGTTGCCCAGGCTGGAGTGCAGTGGCACAATCTTGGCTCACTGAAC
CTCCGCCTCCCGGGTTCAATGGATTCTCCTGTCTCAGCCTCCCGAGTAGC
TGGGATTACAGGCATACGACACCATGCCCAGCTAATTTCTGTATTTTTAG
TAGAGACAGGGTTTCACCATGTTGGCCAGGCTGGTCTTGAACTCCTGACC
TCAAGTGATCTGCCCACCTCGGCCTCCCAAAGTGCTGGGATTACAGGTGT
AAGCCACTGCACCCAGCCCTGGACTCATTAATATTTATTTTATACTTTGG
GTTATAATGTAAAACACTATTCTATTTTGTTGCTCAAATTGTTGCAGCTT
TGGCCACTGGGAGCTCTTCAAGTGGCTCCTGTGTCTTTTTGAAATATCCC
TCACCAATGTAGTTTTGTTTTTGAATAATTCCTTACTTTAAGGTGCTACA
AGATCTTTCATTCTCATTTGTGTATTTCCTGCCCCAGTTTTAGAACTCAA
CAATTTCTCCAAGAAGCCTTGGTTCCAGCTGCTGAGAAATGGCATTAAAA
CTGAGACCAGCCTGGCCAACATGGTGAAACCCTGTCTCTACTGAAAATAC
AAAAACTAGCCGGGCGTGGTGGTGCAGGGCTGTAATCCCAGCTACTCGGG
AGGCTGAGGCAAGAGAATCGCTTGAACCCGGGAGGCGGAGGTTACAGTCG
GCTGAGATCGCGCCACTGCACTCCAGCCTGGGCAACAGAGTGAGACTCTG
TGTCAAAAACAAAAACAAACACAAAAACAAAACAAAACTGAGCTCTGAGC
ACCAGGTGTGCTTGTTGCTACGACAAATATATTTCAAACCTTATATTTTT
AACACCAGCACCCACACAACTACAATACAATTGCACTATTCATAAAACAA
TTATAGATTATTAACAACATTCAATCATGGTGTCATAGGAGCCTGTGGTC
CTACACTGGATCCCACACACAAAACTTGCATATGATGGTCATCTTCTTTC
AGTCCTGTTAGGATTGAAAGAGAGATGTATAGCCTCAGTGGAGATAATAT
CAAAAGTCTAATTTTATTTATTTATTTCTTTCTTTATTTTGAGACAGGGT
CTTACTCTGTGGCCCAGGCTGGAGTGGTGCCATCATAGCTCACTGCAGCC
TCAAATTCCTGGGTTCAAGAAATCCTCCTGCCTCAGCCTCCCAAGTGGCT
AGCACTACAAGTATGTGCCATCATGCCTGGCTATTTTTTTTTTCTTCCGT
TTTTTTAGAGACAGGGTCTACGTTGCCCAAGCTGGTCTTGAATCCCTGGT
CTCAAGTGATCTTCCCACCTCAGCCTCACAAAGTATTGGGACTACAGTTG
TGAGTCATTGTGTCTGGCCCAAAAGTCCAAATTTGAGGCCTTCTTTGGAT
GTGTGGCCACAATAAATGGCTCTTGCAAGGCTGCCAACCCCTTACACTCT
TTCCATAATATGCCATAAGAAAAGCATACTGGATTTAGAAATAGGGCATG
AAAGTTCTGAATCCAGCTGTGTTAGTGTTATAGCATATATAGCAAGTGGA
TTGTGTCTGGGCCTCAATTTCCAATGATACAAAATCAGGAACATCAGATT
GGATAATGGCTAAAGGCCCTCCCAGTTCTAGCACACTATAATTTTCAACA
GACTTACACTGGGGGAATACAATTGGCTCCACTAGTCTTTGTATACAGGC
CTAATATTCCAGAAAGTCTAAACCAGTGGAGGCATGGGGGTGCGGAGGTC
GCGGCTAATAAATCAGAGTCATTTTATTATTTTTTGGGAATGCCAAGACC
TGTTAAAGGCTTTAGATAGTCTAGACAATCGGGCCTGAGAAACTTTAGAC
CTTTCTTTTTAAAGAATGAAGATCAAAAAAGTATAAAAAATATTGATGGA
AAGTATCTCTTTCATTGGTTTCATGTTCTGATAGATCAAGACTTCTTCCT
CTTTTTTTTTTTTTTCCTTTAGTAAGGGAAAACTCCTCATCTGCTTTTTC
CTCTGACTTCAAATAATTACCTTTAATGCAGTGATGGCTGAGCCACCTCT
AAGTTTCTTACCAGGAATCTCTCTCTAGGTTTTTATTTTTTTCTTTTTCT
CTTCCTTCCTTCTCTCCTCTCTGCCTCCCTCTCTTATGCTCTCCCTCCCT
CCCTTCCTTCCCTTGAATGTTATGATGTGTTTTTTACATCCATATACTAC
CCAGTGTACAAATGTCATCTCCTTCCTATCATTGATGGGGACAATTTGCA
AAAACAAATAGAAGGAAAATAAAAAAGGAAATATAGGGCAGAAAAGACAC
TTGGGAACTGTCACATTTGATTATGAATGCTGGAGATCAAAGGTGCAAGG
TCTTAGAACCTACTTCCTCCACCTCTTAACGTTTAAAATCTTCAATTGGC
TTTTGAACCCACTCAGCAAAATCCCAGACTTTGGTCTACAATTGGTTAAA
AATTGATAGAGTGAGGATTCTGGGACTGCCTTCTTTACTTAGAAGTTTAC
ATTTTAACTCCTTCCCTAGCCCCAGGTACACATATACACACAGCTCCTTT
CCACTCCTCTCGCACAGTTCTGTAAATATGTTTTGAAATGTAAAGGTACA
GAACTAAGCGCAGACCGGCATCCCTCAAATCATCGGGGCTATTCCTTCAC
ACAGCTGAGGAAACGGAGTCCTCACAAGTGGCTTTGCTCAATGTCCCATA
AAGAGTTTCAGGCACAGCTGTAATTAGAAACCAAGGGTTTGTGTGTGTGT
GCGCGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTTTGCAACAAATACAGT
GTTTTTTTTTTTCTCCCTACACTGTGCCCCCGTGGAGTCACATTTGTGTG
TCTGTGTCTCTGTACGTACATAAGTTACACAGACACTGACATGTAGGAAA
CGTGCACCAAAGTGTCTGTCTTCTGACCTCAGGTAACAGTATAATGACTT
GAATTTCAGGCAGCTGAAAGGTTTCTGCCGGTGGAGGTTGAAATAAACAA
GAAAAGCCACTGTGGAGATGTGAATGGAAAAGTACCGAGCCCTCCCTCCC
TCCGCACATTCTTCCGCTCTCCAGCTCTCCCTGCCATCGAGCTGGCTTCA
GATAGGCTTCTGCATGGTCAGGTGTACAAGAGGGCGGTGGGGAGAAGAAA
AAAAAAATGCAGGCACAACACGCAAATCAAGTTTTTCCACTTCTAGCCTT
AGGTAGTAGAGACAGCTAAGTACAGCAGCCAGCAGCCCGGCAACGGCAGC
GGGTGGACCAGCCACCCTGAGTTTACAAACACTCAAGTGCTTTCCTTCCC
TCATCCCTCTCAGAGTCCAGCTGCTGCTTTCCTTCATGCTAAGGTTTCAT
AGGAAGTGAAAACTCTGCTATTCAAAACAGCGATCGAACGCAATAAACAA
ATCATTACACACCCCCTAACCCCCATCACTTCTCTATTTTAAGCTTCTGA
TATTTATTCCCATTTTAAATAAGTGAGAAAAGTGTGGAAAATTAGTGTTT
GGGGGTAAACTCTGAGCCAGGCTGAAAAGGTTTCTAAAGGAAAAAAAAAT
CTCAGAACAATAAAGGCTAAAAGCAGGCAGCATATGGATGAAAATTAAAC
ACTGATACTTCCTTTTCAGAAGGCAGTAGCTGGAAATTATACACTTTTTT
AATGTCTCAAAACTTTCTGCTCATCTTGCTATGTTAAAAACGCCTTTCTT
TCTCCAAGGATACTACAAAAAGCTTGTTTACAACAGTTCTAAATGAAGGA
TTTGAAATAAAACGAACAGGTAAAATTTAACAAGTCTGATAGATAGTGTC
TCCCAAATCTATCAAAAGCAGTGCCAAGTACTTCAATGTAGCTGAGAGGC
ATAAATAAACCCAAATGACCATCAAAACTCATCATGACTTGGAGTTCGCT
CTGAGTTTTGCAGTTTACAAAGAGACCATGGCAGCCTTGCTTCCCTCAGT
TCTACAAGGACACAAGATATACACCTACAGACTCAAGTTGTCAGATTACA
CTGATCCTCTAAAATGACAGAGGGCCAGCAAATCATGCAGACCCATTTTC
AGTTGTGTTCCTGGGGTCACACATGCTCCTAGTGAAGACCCAGCCTATAA
TCCTGAAAGAAGAAAGCCTAGAGAAGGTGATTGATTTGAAAAAGTCTTCC
CAGTTTTAAAATCTTTAGTCCTATGATGTGGTATCTTAAAGACCTACCAA
GGTGCCAGAGGTTCCTGACAGGTGAAACCAACTTCCTCTTGTGAGCCCCC
TTAGAAAGAGGACAGAACCGTGTTTATTCCAAGGATAGGTTCTTTTTCAG
CTATGACTGAATTGTGGGGAAGGTTTTGCAAGGGGGAATTGGATGTGAAG
TCTGTTCTTTTCCTCAAATAGATGTAATATTAGGACCAGGCTATTTTATT
TTGTAATAAAGCTTATATTTACCCAGCAGCAATGATCAGGGACCTATTCT
TATGCCCAGTCCATGAGGCAAAGAGGTTGGCCTGGTCCTCTACTGAGTAT
TAGCAGCCAGTAACCATTAAATCATGGGACTAGTTGAAATGTAGTGCCCC
GAAGTCTGCAGGAATTATTCATACGACCCCAGACATGGAATCACTCTTTA
GAGCTTCTTAAGGATGATTTAAAAGAATCAGAATACGTTCAAGTCAGCCC
TTTCTTTAATCCTGTAACACGGCACTGCGGGAGTGAGGGAGGCCCACATA
GTGATGCCAACTGGATACTGAGGAGAGGTCAAGAATGAAAGAAGAAATGA
CATTCTGGAAGAAATTCAACTGGTATAATATTTGACAAAGTTACTTTCCT
AGGAATTGAAAAGAGATTGAGAGGCGGGTGCACAATTTTCCTCACCATTC
ATTCAGTTCAAAGTAAAAGAGACTCACCGAAAAGTAAGTGCCTATCTTTA
GAAAATTTTCAATAATGATTTTCTCTTTCTTTCTAACTGGTCTTCTGTTC
TGGTCAATTTCTTTCAGTGTAAACACATTGATTTGGCAAAAAGCAGTAGG
AAAATGTGGCACTCTGGCACTTGGTCCCAGAAATAATATGCTGGGAAGAT
TTGAGGTCCCTGGTGATGAGGTTAATTATATATGAACCAGCCCTGGTGGG
TTCTCCCTCTAGGGGCTCACTGCAGAGAATTAAAGAGGGCTGAGGTATGA
GAAGGGTGAATTCCTTCCCAGCCCCCACTCTGGCTGGTTCTACTACTGCC
TTAGAGAGCAGATTTCCCTTTGCTCTGCAGCGCCCCCATGGGGCTAAGAG
TGGAGTGGCAAAGGGAACACAGGAGGGACAAGCTGTGTTTCAGGTTGAGG
GGGGCGGTGGATGAGGCTGAATGGCAGTTTTGACAAAGAAAAAAGTGACC
AAAAATCATAAAAATAATCTTTTGAGGGCCCAATAGTAAGGCAGAGCCAT
ACAATTCACATTCCAAACCATATAGATACGTCTGAGAAATCCTAAAGTGC
TAATTGCTCATAAAAGAAAAAATTACACATATAAACACACAAAGAAAATC
CCTTCCACAAAATCGGGGTGTCATTTTGCATCCAGCGGGATTCATTTTAA
TTTCTTTGAAAATGAGAAGGAAGGGGACTCAAATGAAAAAGCAGATAGTC
TGCCTTCTGGCAGAATAAATCTGAACTTGACAATATCATGTGTCTTTGGG
GGTAAAACGTACATTTCAACAACAGTGACAGGATTAGGCCTATGTATATT
TTTCAAAAACCGTTCACAAGACAGGCTTTTCTGCAGAGGCTGCAGTAATC
CATCTGTCAATAAGTATTAAAATATTCAGATTTCACAGGGACAGACACTT
TAACGCATATTTCCTAAGCTCCAGCCCTTGTGGAAAATAATCAACCTCTT
TGCACCTTTCTGGGTTTTAAAACCTAAAATACAGCCTTTAAAAATGTGTG
TGTGTTGTGGGGTAGGGGGGTGCATTGCCAACAACATTTTCGGTGATAGA
TGGAACTTCTTACGGGACTGTCAATGAAAGAGATTTTCCAAATATCCCAG
CAAACAGCAATCTTTCACAGCTCTGATCACTCCTCCATTATAAACCCAAA
TTTTGGGTTGAGATAGGTAGATTATTTTAGACATATCTTTATTAGAAATT
AACAAGTGACGAGATTTTGTGGAAGCTTTAAGAATTCATCTGTAATTTAA
TAAGTCGCTTGAAGGACTCTCATAGCCAAGGCTCAGAACAGCCTGACCTT
TGAAAGCTGCTTCTGGTCCAAACATTTTGGGCTAATTCTTGAGGAATCTG
AAATATTATTTTCCCCTCACACCCTTCTTTTAAGAGAGAGACATAAAAGA
AACAAGAGTCTCCCTTATTCAGGGATGAGTAGGAGGGGAAAAAACCCGAA
CCAACATTTAAATAAGGAAACTAGCAGCTCTGAACAAACAAACTAGGACC
CACAATGAAATGATTCTGCACTGCAATTGCCTTTAAAAAGAAAGTAATAG
AGAAAAAGAGAAGGAAAGAATTTCTCCTTCTTCTCTACCCCCCCCCCACC
CCACCCCCCAACTCAGCTTCAAAGCTAAGAAGACTGTGCTGCGTGTAGTG
CATTGTAGTTGTGGCAGTCTGTTCTAAATACAGGCAGTATCTGTGATACT
GGCACGGCAGGCCTTTAGAATTCCCTCCGGCTGATCTCTTAAACACAGAC
TGAAGAGATTTTTTTACAACGACCTTGAAACGAGCCTCGAAAACAAAAAT
CTCAAGACCTTAAGAGAAAACAAAACACAAACAGGTATTTGGCTCACAGA
ATTTTGTAGAAAACACACACATACCACCCCGCCACCCCCACCCTCCCCCC
CACACACACGTTTCTTGCAACAAGAAATTTCCCAAGAGTCAACAATAACA
GATTAAACCCACCACTTGCTGTCCTGGAAAGAAACAAACCAAACCAAAAC
AAATCCTTTGAACATTTCTCTGAAGTGCAGGAGAGACACACTTCAGCAAA
AGTCCAAGGGGGAAAAAGAAAATTGCACCAAAGGAAAAAAAAAAAAAAAA
AGTGGGGGCTGGGATTGTTACATATGGCCAAAAATTTAAGCTTCTTTCAA
TAGTATTAGTATTGAAATAATACATCTTTAAAACGCTTGAGGGATTAGAT
AGGGAAAGAAAAGGCACGTACAAAAAAATCCAACCGATGCCGATCCTGTG
ATTTACGTAACACCACAAACTTGCAAAAGGCAAAAAATCAGAAGCAAAAA
TCCATAAACCATCAAAATACAGAAACCAAAAATCCCAAGCCACCACACCA
GAAAGAAAAAAACCCAGAACAACAGCAAAAACCCCTGTCCTAAATAAAAA
TAAAGCAAATGAACCCACCGAAAACTGCTTGGCAAATATTTTTCTCGTGG
TGCCTAATATTCTAGTTGGAAAGAGCTGTGATGTTTATTTTATTTTATTT
TTCTCTTACTCGCCTCTCTAACCCTACTATATATATAACATACTTTTCCC
AGTGGTTCAAACCTCTCGCTCCCTTTTGTGCATTTAGCTCGATCTGCTGA
GTTTATGGGTAAGAAAGAAGGAATTAGCCCCAGACCCCGGGAAAGCAAAG
CGCACTCCCCCTCTTATGTCACCGAATAGCAAATTAGTTCTCAGAATTCC
AGAGGCCGAGCTTTGCTACAGCGAAGGCGCCGACGTCACAGAGGAGGAGC
CCACGTGATGGTGGCGGAGCAGGCCATACCATCGTCTTGGGCCCGGGGAG
GGAGAGCCACCTTCA
How is this gene expressed?
39Predicting expression from sequence macrophage
differentiation
Macrophage
Granulocytic monocytic progenitor cell
TPA
HL-60 cell line
Differentiated HL-60 cells
- Using only sequence information, can we predict
genes up- or down-regulated during macrophage
differentiation?
40CRM detection prediction of new target genes
transcriptional regulatory model
100 new target genes
18 upregulated genes
41CRM detection prediction of new target genes
transcriptional regulatory model
100 new target genes
18 upregulated genes
100
top 20
10
Fold upregulation
1
0.1
0.01
42Prioritization of new target genes
training genes
18 upregulated genes
candidate genes
transcriptional regulatory model
100 new target genes
prioritization
43Prioritization of new target genes - conclusions
training genes
18 upregulated genes
candidate genes
transcriptional regulatory model
100 new target genes
prioritization
top 20
top 20
Fold upregulation
44Conclusions
- Endeavour prioritizes candidate genes
- Looks for similarities with known disease/pathway
genes - Integrates information from many heterogeneous
data sources - Computational validation
- Disease/pathway genes ranked on average at the
10th position of 100 candidate genes - In vitro validation
- Predicting gene expression from sequence
45Introduction
Gene expression
Regulatory regions
3
Genes
46Identification of disease causing mechanisms
THRLBCL
- T cell/histiocyte rich large B cell lymphoma
- Similarities with nodular lymphocyte predominant
Hodgkins lymphoma (NLPHL) - Functional meaning of the THRLBCL
microenvironment? - Microarray expression profiling of THRLBCL, in
comparison with NLPHL
THRLBCL NLPHL
0
5
10
15
Survival (years)
47The microarray experiment - PCA plot
THRLBCL NLPHL reactive lymph node
48A three-gene quantitative RT-PCR classifier of
THRLBCL vs NLPHL
- 3 most significant genes
- One calibrator of each lymphoma type
- Each converted to give 6 percentage scores
- Averaged to give one NLPHL and one THRLBCL
similarity score
classification by the three-gene classifier
diagnosis by morphology
THRLBCL
NLPHL
0
46
NLPHL
23
0
THRLBCL
49Differential expression
FCER1G VSIG4 IDO CCL8 TLR1 TLR2 TLR4 TLR8 CD14 STA
T1 CCR1 CXCL10 CXCL16 CCRL2 CD80 CD86 CD274 CSF1R
CSF3R PDCD1LG2
FCGR3B FCGR1A ICAM1 IL1RN IL18BP IRAK3 CD74 S100A9
CASP5 MSR1 CD163 SOD2 IFNAR1 IFNGR2 IFIT3 IFI6 C1
QA C1QC C2 C3AR1
THRLBCL signature
FCRL1 CD79A CD79B CD19 CD22 MS4A1
PAX5 BCL11A FGFR1OP FCER2 BANK1
NLPHL signature
50The model
CCL8
recruitment
scavenger receptors
IDO
macrophages and dendritic cells
tumor tolerance
Toll-like receptors
innate immunity
VSIG4
VSIG4
activation
IFN-?
51Conclusions
- Expression profiles are in line with differences
in microenvironment between THRLBCL and NLPHL - Insight into the functional significance of the
microenvironment - Tolerogenic immune response
- New targets for therapy
52Breast cancer clinicopathological significance
of polysomy 17
Tumour grade
HER2 amplification
I
II
III
Nottingham Prognostic Index
Normal
I
II
III
ER status
Polysomy 17
HER2 expression
Negative Positive
PR status
gt 10
lt 1
1 - 3
3 - 5
5 - 10
Negative Positive
53Final conclusions
- Development of novel systems biology methods
- ModuleMiner CRM detection
- Endeavour gene prioritization
- Systems biology to gain more insight into
diseases and processes - Predicting expression from sequence integrated
case study - A tolerogenic immune response in THRLBCL
- Clinicopathological significance of polysomy 17
in breast cancer
54Perspectives
- Systems biology methods for the identification
of - Regulatory regions
- Protein-binding microarrays more and better PWMs
- Disease genes
- Three systems biology methods
- Combination with array-CGH
- Disease mechanisms
- Microarrays focused experiments
- Data integration
- Disease treatment
- Insight ? directed treatment
55Systems biology identification of regulatory
regions and disease causing genes and mechanisms
- PhD defense
- Peter Van Loo
- Promotor P. Marynen
- Co-promotors B. De Moor
- and C. De Wolf-Peeters
- Human Genome Laboratory
- Departement of Human Genetics
- May 23th, 2008
563 Conservation options
- 1. All predicted binding sites in all human-
mouse conserved non-coding sequences (CNSs), 10
kb 5 of transcription start - 2. Same as (1), but limit ot binding sites that
occur in both the human and mouse CNS - 3. Same as (2), but add 100 kb of mouse sequence
both 5 and 3 (to correct for transcription
start annotation errors)
10 kb
CDS
CNS 1
Human gene
5 UTR
Exon 1
LAGAN VISTA
Mouse gene
10 kb
Human CNS
Mouse CNS
10 kb
CDS
CNS 1
Human gene
5 UTR
Exon 1
LAGAN VISTA
Mouse gene
100 kb
10 kb 100 kb
57ModuleMiner performance TRMs and TRGMs
58Comparison to other CRM detection algorithms -
results
59Comparison to other CRM detection algorithms -
results
- Using TFBS set 2 in ranking step improves
performance of other methods
60Comparison to other CRM detection algorithms -
results
- TFBS set 2 does not always do best
61Application of ModuleMiner to microarray clusters
62Application of ModuleMiner to embryonic
development sets
Embryonic develop-ment process TFBS set Nr target genes after leave-one-out cross-validation (p-val) AUC
Primary heart field 44 1 6 / 7 (p 6.4 ? 10-6) 0.92
Secondary heart field 44 1 6 / 9 (p 6.4 ? 10-5) 0.79
Neural crest cells 45 2 6 / 10 (p 1.5 ? 10-4) 0.86
Eye development 46 1 10 / 15 (p 1.9 ? 10-7) 0.79
Limb development 47 1 10 / 24 (p 5.2 ? 10-5) 0.77
63Where are the CRM predictions located?
TFBS set 1 and 2 TFBS set 3
Microarray clusters Development sets
64How are genes ranked?
- Vector-based data source
- Microarray data
- Candidate gene with expression similar to that of
genes known for the disease gets a high score - Literature
- Motifs
- Attribute-based data source
- Gene Ontology
- Interpro protein domains
- KEGG pathways
- EST anatomical expression
expression inbrain, liver, kidney,...
Known disease genes Low score candidates High
score candidates
???????????? ???????????? ????????????
???????????? ???????????? ???????????? ??????????
??
cytoskeletonGO0005856
65Order statistics
- Given a set of n ordered rank ratios for gene i
- (9/100 4/120 30/150 30/50 2/10 80/80)
- ? (0.09 0.03 0.2 0.6 0.2 1)
- ? (0.03 0.09 0.2 0.2 0.3 0.5 0.6 0.8)
- What is the probability of getting these rank
ratios or better by chance alone? - How many rank vectors does my vector strictly
dominate? - Joint probability density function of all n order
statistics - Recursive formula of complexity O(n2)
66Validation of the literature data source
Prioritizations of 199 random gene the
indicated disease gene
67Validation of genes related to complex diseases
Prioritizations of 199 random gene the
indicated disease gene
68Disease case study DiGeorge syndrome
Atypical 22q11 deletion 58 candidate genes
69Ypel1 as a novel DGS gene validation in zebrafish
70A screen for genes involved in congenital heart
defects (CHD) work in progress
Array-CGH of CHD patients with a chromosomal
phenotype
Map (micro)deletions and (micro)duplications
1.0
0
-1.0
Chr 14
Known CHD gene explains phenotype
No known CHD gene in deleted/duplicated region(s)
Endeavour prioritization
Validation in zebrafish
Morpholino knockdown
in situ hybridisation
71A screen for genes involved in congenital heart
defects (CHD) work in progress
Array-CGH of CHD patients with a chromosomal
phenotype
Map (micro)deletions and (micro)duplications
100
16
1.0
0
-1.0
Chr 14
Known CHD gene explains phenotype
No known CHD gene in deleted/duplicated region(s)
7
9
Endeavour prioritization
Validation in zebrafish
Morpholino knockdown
in situ hybridisation
72CHD gene prioritization optimizing the
performance
Extra data source Microarray data embryonic
heart development (mouse)
Multiple training sets Validation/optimization of
each training set by leave-one-out
cross-validation
Performance gain
Primary heart field
Secondary heart field
CHD genes
Neural crest cells
Vascularization
Combine prioritizations using different training
sets into one prioritization
73CHD gene prioritization preliminary results (in
situ hybridisation)
Chr 14
Chr 4
1.0
1.0
0
0
-1.0
-1.0
74Bias to well characterized genes
75Endeavour
http//www.esat.kuleuven
.be/endeavour
76Differential expression histogram of p-values
77Is the spleen sample abberrant?
78Quantitative RT-PCR validation
- Fold change THRLBCL vs NLPHL
- Genes selected for involvement in
- Interferon pathways
- Macrophage activation
- Innate immune responses
Gene symbol Description Fold difference microarray Fold difference quantitative RT-PCR (p-value1)
IFN-? Interferon gamma 4.72 4.4 (p 1.0 x 10-5)
STAT-1 Signal transducer and activator of transcription 1 1.6 2.9 (p 4.4 x 10-9)
CD74 HLA class II histocompatibility antigen gamma chain 2.8 1.2 (p 0.21)
CCL8 (MCP-2) Monocyte chemotactic protein 2 143.5 84.8 (p 3.9 x 10-9)
IDO Indoleamine 2,3-dioxygenase 9.0 12.3 (p 1.6 x 10-8)
IFN-?1 Interferon alpha 1 1.02 0.92 (p 0.81)
IFN-?R2 Interferon alpha receptor 2 0.92 1.3 (p 5.3 x 10-3)
STAT-2 Signal transducer and activator of transcription 2 1.82 1.3 (p 0.11)
TLR8 Toll-like receptor 8 11.5 11.5 (p 6.4 x 10-11)
MyD88 Myeloid differentiation primary response gene (88) 1.82 2.2 (p 6.7 x 10-7)
1 T-test, not corrected for multiple testing. 2
Difference was not significant at p lt 0.001
(after correction for multiple testing).
79Sensitivity of the classifier to the choice of
reference samples