Title: Computational Biology and Gene Expression
1MedGen 505 Gene Regulation Bioinformatics Wyeth
W. Wasserman
www.cisreg.ca
2Overview
- TFBS Prediction with Motif Models
- Improving Specificity of Predictions
- Analysis of Sets of Co-Expressed and Co-Regulated
Genes
3Transcription Factor Binding Sites(over-simplifie
d for pedagogical purposes)
URF
Pol-II
TATA
URE
4Teaching a computer to find TFBS
5Laboratory Discovery of TFBS
ACTIVITY
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
6Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA
- A set of sites represented as a consensus
- VDRTWRWWSHD (IUPAC degenerate DNA)
7PFMs to PWMs
Add the following features to the model 1.
Correcting for the base frequencies in DNA 2.
Weighting for the confidence (depth) in the
pattern 3. Convert to log-scale probability for
easy arithmetic
w matrix
f matrix
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
f(b,i) s(n)
Log ( )
p(b)
8Performance of Profiles
- 95 of predicted sites bound in vitro (Tronche
1997) - MyoD binding sites predicted about once every 600
bp (Fickett 1995) - The Futility Conjuncture
- Nearly 100 of predicted transcription factor
binding sites have no function in vivo
9JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING
PROFILES
10PROBLEM Too many spurious predictions
Actin, alpha cardiac
11Terms
I.9
- Specificity The portion of predictions that are
correct - Sensitivity The portion of positives that are
detected - The detection of TFBS is limited by terrible
specificity. Why?
12Method1Phylogenetic Footprinting
- 70,000,000 years of evolution reveals most
regulatory regions
13Phylogenetic Footprinting
FoxC2
100 80 60 40 20 0
14Phylogenetic Footprinting to Identify Functional
Segments
Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse with
DPB.
15Phylogenetic Footprinting Dramatically Reduces
Spurious Hits
Actin, alpha cardiac
16Performance Human vs. Mouse
SELECTIVITY
SENSITIVITY
- Testing set 40 experimentally defined sites in
15 well studied genes (Replicated with 100 site
set) - 75-90 of defined sites detected with
conservation filter, while only 11-16 of total
predictions retained
17ConSite (www.cisreg.ca)
NEW Ortholog Sequence Retrieval Service
18Emerging Issues
- Multiple sequence comparisons
- Incorporate phylogenetic trees
- Visualization
- Analysis of closely related species
- Phylogenetic shadowing
- Genome rearrangements
- Inversion compatible alignment algorithm
- Higher order models of TFBS
19OnLine Resources for Phylogenetic Footprinting
I.18
- Linked to TFBS
- ConSite
- rVISTA
- Alignments
- Blastz
- Lagan
- Avid
- ORCA
- Visualization
- Sockeye
- Vista Browser
- PipMaker
20Method2Discrimination of Regulatory Modules
- TFs do NOT act in isolation
21Layers of Complexity in Metazoan Transcription
22Diverse and non-uniform use of terms Partial
glossary for tutorial
Promoter Region
Distal Regulatory Region
Proximal Regulatory Region
Distal R.R.
EXON
EXON
TFBS
TATA
TFBS
TFBS
TFBS
TFBS
TFBS
TFBS
TSS
- Promoter Sufficient to support the initiation
of transcription orientation dependent includes
TSS - Regulatory Regions
- Proximal adjacent to promoter
- Distal some distance away from promoter (vague)
- May be positive (enhancing) or negative
(repressing) - TSS transcription start site
- TFBS single transcription factor binding site
- Modules Sets of TFBS that function together
23Detecting Clusters of TF Binding Sites
- Trained Methods
- Sufficient examples of real clusters to establish
weights on the relative importance of each TF - Statistical Over-Representation of Combinations
- Binding profiles available for a set of
biologically motivated TFs
24Training for the detection of liver
cis-regulatory modules (CRMs)
25Models for Liver TFs
HNF3
HNF1
HNF4
C/EBP
26Logistic Regression Analysis
a1 a2 a3 a4
Optimize a vector to maximize the distance
between output values for positive and negative
training data. Output value is
elogit
p(x) 1
elogit
S
logit
27Performance of the Liver Model
- Performance
- Sensitivity 60 of known CRMs detected
- Specificity 1 prediction/35,000bp
- Limitations
- Applies to genes expressed late in hepatocyte
differentiation - Requires 10-15 genes in positive training set
- This model doesnt account for multiple sites for
the same TF - New methods from several groups address this limit
28UGT1A1
Wildtype Other
Liver Module Model Score
Window Position in Sequence
29Making better predictions
- Profiles make far too many false predictions to
have predictive value in isolation - Phylogenetic footprinting eliminates 90 of
false predictions - Algorithms for detection of clusters of binding
sites perform better, especially when possible to
create train on known examples for the target
context
30Method3 Higher Order Models
- Position-position dependence
31Probabilistic Methods for Pattern Discovery(7)
What is a higher-order background model?
p(A)0.29, p(C)0.21, p(G)0.21, p(T)0.29
Zero-order
32Linking co-expressed genes to candidate
transcription factors
33Deciphering Regulation of Co-Expressed Genes
34oPOSSUM Procedure
35Statistical Methods for Identifying
Over-represented TFBS
- Z scores
- Based on the number of occurrences of the TFBS
relative to background - Normalized for sequence length
- Simple binomial distribution model
- Fisher exact probability scores
- Based on the number of genes containing the TFBS
relative to background - Hypergeometric probability distribution
36The oPOSSUM Database
- Orthologous genes 8468
- Promoter pairs 6911
- Promoters with TFBS 6758
- Total of TFBS predictions 1638293
- Overall failure rate 20.2
37Validation using Reference Gene Sets
A. Muscle-specific (23 input 16 analyzed) A. Muscle-specific (23 input 16 analyzed) A. Muscle-specific (23 input 16 analyzed) A. Muscle-specific (23 input 16 analyzed) B. Liver-specific (20 input 12 analyzed) B. Liver-specific (20 input 12 analyzed) B. Liver-specific (20 input 12 analyzed) B. Liver-specific (20 input 12 analyzed)
Rank Z-score Fisher Rank Z-score Fisher
SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08
MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03
c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01
Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01
TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02
deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01
S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01
Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02
Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01
HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01
TFs with experimentally-verified sites in the
reference sets.
38Application to Microarray Data Sets
- NF-?B inhibition microarray study
39Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed)
TF Class Rank Z-score Fisher No. Genes
p65 REL 1 36.57 5.66e-12 62
NF-kappaB REL 2 32.58 5.82e-11 61
c-REL REL 3 26.02 8.59e-08 63
Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6
SPI-B ETS 5 16.59 1.23e-03 135
Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23
Sox-5 HMG 7 15.38 2.56e-02 126
p50 REL 8 14.72 2.23e-03 19
Nkx HOMEO 9 13.66 2.29e-03 111
Bsap PAIRED 10 13.2 9.92e-02 1
FREAC-4 FORKHEAD 11 12.05 1.66e-03 92
n-MYC bHLH-ZIP 25 6.695 1.84e-03 102
ARNT bHLH 26 6.695 1.84e-03 102
HNF-3beta FORKHEAD 29 5.948 3.32e-03 47
SOX17 HMG 31 5.406 8.60e-03 79
40C-Myc SAGE Data
- c-Myc transcription factor dimerizes with the Max
protein - Key regulator of cell proliferation,
differentiation and apoptosis - Menssen and Hermeking identified 216 different
SAGE tags corresponding to unique mRNAs that were
induced after adenoviral expression of c-Myc in
HUVEC cells - They then went on to confirm the induction of 53
genes using microarray analysis and RT-PCR
41Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed)
TF Class Rank Z-score Fisher No. Genes
Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7
Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2
Max bHLH-ZIP 3 18.32 2.16e-02 12
SAP-1 ETS 4 13.23 1.61e-04 13
USF bHLH-ZIP 5 11.90 1.84e-01 16
SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12
n-MYC bHLH-ZIP 7 11.11 1.55e-01 20
ARNT bHLH 8 11.11 1.55e-01 20
Elk-1 ETS 9 10.92 3.88e-03 19
Ahr-ARNT bHLH 10 10.17 1.11e-01 25
42C-Fos Microarray Experiment
- In a study examining the role of transcriptional
repression in oncogenesis, Ordway et al. compared
the gene expression profiles of fibroblasts
transformed by c-fos to the parental 208F rat
fibroblast cell line - We mapped the list of 252 induced Affymetrix Rat
Genome U34A GeneChip sequences to 136 human
orthologs
43Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed)
TF Class Rank Z-score Fisher No. Genes
c-FOS bZIP 1 17.53 2.60e-05 45
RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1
PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1
CREB bZIP 4 3.626 1.25e-01 10
E2F Unknown 5 2.965 7.67e-02 15
NF-kappaB REL 6 2.915 1.04e-01 17
SRF MADS 7 2.707 2.24e-01 2
MEF2 MADS 8 2.634 1.32e-01 13
c-REL REL 9 2.467 5.79e-02 22
Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1
Ahr-ARNT bHLH 15 1.716 2.57e-03 63
deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75
Elk-1 ETS 21 0.7875 8.12e-03 37
MZF_1-4 ZN-FINGER, C2H2 27 -0.2421 5.41e-03 73
n-MYC bHLH-ZIP 30 -0.8738 8.20e-03 51
ARNT bHLH 31 -0.8738 8.20e-03 51
44oPOSSUM Server
45- http//www.cisreg.ca/cgi-bin/oPOSSUM/opossum
INPUT A LIST OF CO-EXPRESSED GENES
46SELECT YOUR TFBS PROFILES
47- SELECT
- CONSERVATION
- PSSM MATCH THRESHOLD
- PROMOTER REGION
- STATISTICAL MEASURE
48de novo Discovery of TF Binding Sites
49Pattern Discovery
50de novo Pattern Discovery
- Exhaustive
- e.g. YMF (Sinha Tompa)
- Generalization Identify over-represented
oligomers in comparison of and - (or
complete) promoter collections - Monte Carlo/Gibbs Sampling
- e.g. AnnSpec (Workman Stormo)
- Generalization Identify strong patterns in
promoter collection vs. background model of
expected sequence characteristics
51Exhaustive methods
Word based methods How likely are X words in a
set of sequences, given sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC gtEP71002 ()
CeIV msp-56 B range -100 to -75
TTCAAATTTTAACGCCGGAATAATCTCCTATT gtEP63009 () Ce
Cuticle Col-12 range -100 to -75
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG gtEP63010 () Ce
Cuticle Col-13 range -100 to -75
TATCGTCATTCTCCGCCTCTTTTCTT gtEP11013 () Ce
vitellogenin 2 range -100 to -75
GCTTATCAATGCGCCCGGAATAAAACGCTATA gtEP11014 () Ce
vitellogenin 5 range -100 to -75
CATTGACTTTATCGAATAAATCTGTT gtEP11015 (-) Ce
vitellogenin 4 range -100 to -75
ATCTATTTACAATGATAAAACTTCAA gtEP11016 () Ce
vitellogenin 6 range -100 to -75
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT gtEP11017 () Ce
calmodulin cal-2 range -100 to -75
TTTCAAATCCGGAATTTCCACCCGGAATTACT gtEP63007 (-) Ce
cAMP-dep. PKR P1 range -100 to -75
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC gtEP63008 () Ce
cAMP-dep. PKR P2 range -100 to -75
ACTGAACTTGTCTTCAAATTTCAACACCGGAA gtEP17012 () Ce
hsp 16K-1 A range -100 to -75 TCAATGCCGGAATTCTGAA
TGTGAGTCGCCCT gtEP55011 (-) Ce hsp 16K-1 B range
52Exhaustive methods
Over-representation How many words of type
AGGAGTGA are found in our sequences?
How likely is this result?
53Exhaustive methods
Find all words of length 7 in the yeast genome
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGC
GTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCA
CTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTG
GAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACG
GTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAA
TCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGA
AGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGG
AAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTC
AAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAG
CTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGC
TCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTG
TTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAAT
TGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGT
ATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCC
ATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGC
CAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGA
TGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTA
TTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGT
TTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCA
GGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGA
CGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTG
GTGACACTGCCA
Make a lookup table AAACCTTT 456 TTTTTTTT 5778
8 GATAGGCA 589 Etc...
54Probabilistic Methods for Pattern Discovery
tgacttcc
The Gibbs Sampling algorithm
tgatctct
agacctca
tgacctct
Two data structures used 1) Current pattern
nucleotide frequencies qi,1,..., qi,4 and
corresponding background frequencies pi,1,...,
pi,4 2) Current positions of site startpoints
in the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j. One starting
point in each sequence is chosen randomly
initially.
55Probabilistic Methods for Pattern Discovery
Iteration step Remove one sequence z from the
set. Update the current pattern according to
z
A
tgacttcc
tgatctct
agacctca
tgacctct
Pseudocount for symbol j
Sum of all pseudocounts in column
56Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
57Four Approaches to Improve Sensitivity
- Better background models
- -Higher-order properties of DNA
- Phylogenetic Footprinting
- HumanMouse comparison eliminates 75 of
sequence - Regulatory Modules
- Architectural rules
- Limit the types of binding profiles allowed
- TFBS patterns are NOT random
58Enhancing pattern detection sensitivity
TFBSs are not randomly drawn
Information segmentation Information content
distributions of TFBS are distinctly
non-random (Wasserman et al 2000)
Palindromicity, dyads (van Helden et al
2000) Variable gaps (Hu 2003)
59Pattern discovery methods using biochemical
constraints
60(No Transcript)
61Our Hypothesis
- Point 1 Structurally-related DNA binding domains
interact with similar target sequences - Exceptions exist (e.g. Zn-fingers)
- Point 2 There are a finite number of binding
domains used in human TFs - Approximately 20-25
- Idea We could use the shared binding properties
for each family to focus pattern detection
methods - Constrain the range of patterns sought
62Comparison of profiles requires alignment and a
scoring function
- Scoring function based on sum of squared
differences - Align frequency matrices with modified
Needleman-Wunsch algorithm - Calculate empirical p-values based on simulated
set of matrices
63Intra-family comparisons more similar than
inter-family
TF Database (JASPAR)
COMPARE
Jackknife Test 87 correct Independent Test
Set 93 correct
64(No Transcript)
65FBPs enhance sensitivity of pattern detection
66(No Transcript)
67REVIEWING THE TOP POINTS
68OrientationRegulatory regions problem space
Sets of binding sites AATCACCAAATCACCAAATCACCA
AATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATC
TCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATA
ATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTA
GCGATTCTTCCTATGAACAGATTAAAAAGACCCCA
Specificity profiles for binding sites A -2
0 -2 -0.415 0.585 -2 -2 2.088 -2
-2 -1 0.585 C 1 0.585 0 0
-1 -2 -2 -2 2.088 -2 0.585 0.807
G 0.585 0.322 0.807 1.585 1 -2 2
-2 -2 2.088 -2 0 T 0.319 0.322
1 -2 0 2.088 -1 -2 -2 -2
1.459 -0.415
Clusters of binding sites
Transcription factors Transcription factor
binding sites Regulatory nucleotide sequences
69Analysis of regulatory regions with TFBS
Detecting binding sites in a single sequence
Scanning a sequence against a PWM
Sp1
Abs_score 13.4 (sum of column scores)
70Analysis of regulatory regions with TFBS
Phylogenetic Footprints
Scanning a single sequence
Scanning a pair orf orthologous sequences for
conserved patterns in conserved sequence regions
A dramatic improvement in the percentage of
biologically significant detections
- Low specificity of profiles
- too many hits
- great majority not biologically significant
71Pattern Discovery
72Concluding Thoughts
- Bioinformatics is often constrained by our
understanding of biochemistry rather than
computational or statistical limitations - Evolution has a powerful influence on the
performance of many bioinformatics methods - Computational predictions have value, but only if
you understand the limitations of the methods