Computational Biology and Gene Expression - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Computational Biology and Gene Expression

Description:

MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman www.cisreg.ca Overview TFBS Prediction with Motif Models Improving Specificity of Predictions Analysis of ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 73
Provided by: SE0
Category:

less

Transcript and Presenter's Notes

Title: Computational Biology and Gene Expression


1
MedGen 505 Gene Regulation Bioinformatics Wyeth
W. Wasserman
www.cisreg.ca
2
Overview
  • TFBS Prediction with Motif Models
  • Improving Specificity of Predictions
  • Analysis of Sets of Co-Expressed and Co-Regulated
    Genes

3
Transcription Factor Binding Sites(over-simplifie
d for pedagogical purposes)
URF
Pol-II
TATA
URE
4
Teaching a computer to find TFBS
5
Laboratory Discovery of TFBS
ACTIVITY
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
6
Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA
  • A single site
  • AAGTTAATGA
  • A set of sites represented as a consensus
  • VDRTWRWWSHD (IUPAC degenerate DNA)

7
PFMs to PWMs
Add the following features to the model 1.
Correcting for the base frequencies in DNA 2.
Weighting for the confidence (depth) in the
pattern 3. Convert to log-scale probability for
easy arithmetic
w matrix
f matrix
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
f(b,i) s(n)
Log ( )
p(b)
8
Performance of Profiles
  • 95 of predicted sites bound in vitro (Tronche
    1997)
  • MyoD binding sites predicted about once every 600
    bp (Fickett 1995)
  • The Futility Conjuncture
  • Nearly 100 of predicted transcription factor
    binding sites have no function in vivo

9
JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING
PROFILES
10
PROBLEM Too many spurious predictions
Actin, alpha cardiac
11
Terms
I.9
  • Specificity The portion of predictions that are
    correct
  • Sensitivity The portion of positives that are
    detected
  • The detection of TFBS is limited by terrible
    specificity. Why?

12
Method1Phylogenetic Footprinting
  • 70,000,000 years of evolution reveals most
    regulatory regions

13
Phylogenetic Footprinting
FoxC2
100 80 60 40 20 0
14
Phylogenetic Footprinting to Identify Functional
Segments
Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse with
DPB.
15
Phylogenetic Footprinting Dramatically Reduces
Spurious Hits
Actin, alpha cardiac
16
Performance Human vs. Mouse
SELECTIVITY
SENSITIVITY
  • Testing set 40 experimentally defined sites in
    15 well studied genes (Replicated with 100 site
    set)
  • 75-90 of defined sites detected with
    conservation filter, while only 11-16 of total
    predictions retained

17
ConSite (www.cisreg.ca)
NEW Ortholog Sequence Retrieval Service
18
Emerging Issues
  • Multiple sequence comparisons
  • Incorporate phylogenetic trees
  • Visualization
  • Analysis of closely related species
  • Phylogenetic shadowing
  • Genome rearrangements
  • Inversion compatible alignment algorithm
  • Higher order models of TFBS

19
OnLine Resources for Phylogenetic Footprinting
I.18
  • Linked to TFBS
  • ConSite
  • rVISTA
  • Alignments
  • Blastz
  • Lagan
  • Avid
  • ORCA
  • Visualization
  • Sockeye
  • Vista Browser
  • PipMaker

20
Method2Discrimination of Regulatory Modules
  • TFs do NOT act in isolation

21
Layers of Complexity in Metazoan Transcription
22
Diverse and non-uniform use of terms Partial
glossary for tutorial
Promoter Region
Distal Regulatory Region
Proximal Regulatory Region
Distal R.R.
EXON
EXON
TFBS
TATA
TFBS
TFBS
TFBS
TFBS
TFBS
TFBS
TSS
  • Promoter Sufficient to support the initiation
    of transcription orientation dependent includes
    TSS
  • Regulatory Regions
  • Proximal adjacent to promoter
  • Distal some distance away from promoter (vague)
  • May be positive (enhancing) or negative
    (repressing)
  • TSS transcription start site
  • TFBS single transcription factor binding site
  • Modules Sets of TFBS that function together

23
Detecting Clusters of TF Binding Sites
  • Trained Methods
  • Sufficient examples of real clusters to establish
    weights on the relative importance of each TF
  • Statistical Over-Representation of Combinations
  • Binding profiles available for a set of
    biologically motivated TFs

24
Training for the detection of liver
cis-regulatory modules (CRMs)
25
Models for Liver TFs
HNF3
HNF1
HNF4
C/EBP
26
Logistic Regression Analysis
a1 a2 a3 a4
Optimize a vector to maximize the distance
between output values for positive and negative
training data. Output value is
elogit
p(x) 1
elogit
S
logit
27
Performance of the Liver Model
  • Performance
  • Sensitivity 60 of known CRMs detected
  • Specificity 1 prediction/35,000bp
  • Limitations
  • Applies to genes expressed late in hepatocyte
    differentiation
  • Requires 10-15 genes in positive training set
  • This model doesnt account for multiple sites for
    the same TF
  • New methods from several groups address this limit

28
UGT1A1
Wildtype Other
Liver Module Model Score
Window Position in Sequence
29
Making better predictions
  • Profiles make far too many false predictions to
    have predictive value in isolation
  • Phylogenetic footprinting eliminates 90 of
    false predictions
  • Algorithms for detection of clusters of binding
    sites perform better, especially when possible to
    create train on known examples for the target
    context

30
Method3 Higher Order Models
  • Position-position dependence

31
Probabilistic Methods for Pattern Discovery(7)
What is a higher-order background model?
p(A)0.29, p(C)0.21, p(G)0.21, p(T)0.29
Zero-order
32
Linking co-expressed genes to candidate
transcription factors
33
Deciphering Regulation of Co-Expressed Genes
34
oPOSSUM Procedure
35
Statistical Methods for Identifying
Over-represented TFBS
  • Z scores
  • Based on the number of occurrences of the TFBS
    relative to background
  • Normalized for sequence length
  • Simple binomial distribution model
  • Fisher exact probability scores
  • Based on the number of genes containing the TFBS
    relative to background
  • Hypergeometric probability distribution

36
The oPOSSUM Database
  • Orthologous genes  8468
  • Promoter pairs  6911
  • Promoters with TFBS  6758
  • Total  of TFBS predictions  1638293
  • Overall failure rate  20.2

37
Validation using Reference Gene Sets
A. Muscle-specific (23 input 16 analyzed) A. Muscle-specific (23 input 16 analyzed) A. Muscle-specific (23 input 16 analyzed) A. Muscle-specific (23 input 16 analyzed) B. Liver-specific (20 input 12 analyzed) B. Liver-specific (20 input 12 analyzed) B. Liver-specific (20 input 12 analyzed) B. Liver-specific (20 input 12 analyzed)
Rank Z-score Fisher Rank Z-score Fisher
SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08
MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03
c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01
Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01
TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02
deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01
S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01
Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02
Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01
HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01
TFs with experimentally-verified sites in the
reference sets.
38
Application to Microarray Data Sets
  • NF-?B inhibition microarray study

39
Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed) Genes Significantly Down-regulated by the NF-?B inhibitor (326 input 179 analyzed)
TF Class Rank Z-score Fisher No. Genes
p65 REL 1 36.57 5.66e-12 62
NF-kappaB REL 2 32.58 5.82e-11 61
c-REL REL 3 26.02 8.59e-08 63
Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6
SPI-B ETS 5 16.59 1.23e-03 135
Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23
Sox-5 HMG 7 15.38 2.56e-02 126
p50 REL 8 14.72 2.23e-03 19
Nkx HOMEO 9 13.66 2.29e-03 111
Bsap PAIRED 10 13.2 9.92e-02 1
FREAC-4 FORKHEAD 11 12.05 1.66e-03 92
n-MYC bHLH-ZIP 25 6.695 1.84e-03 102
ARNT bHLH 26 6.695 1.84e-03 102
HNF-3beta FORKHEAD 29 5.948 3.32e-03 47
SOX17 HMG 31 5.406 8.60e-03 79
40
C-Myc SAGE Data
  • c-Myc transcription factor dimerizes with the Max
    protein
  • Key regulator of cell proliferation,
    differentiation and apoptosis
  • Menssen and Hermeking identified 216 different
    SAGE tags corresponding to unique mRNAs that were
    induced after adenoviral expression of c-Myc in
    HUVEC cells
  • They then went on to confirm the induction of 53
    genes using microarray analysis and RT-PCR

41
Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed)
TF Class Rank Z-score Fisher No. Genes
Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7
Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2
Max bHLH-ZIP 3 18.32 2.16e-02 12
SAP-1 ETS 4 13.23 1.61e-04 13
USF bHLH-ZIP 5 11.90 1.84e-01 16
SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12
n-MYC bHLH-ZIP 7 11.11 1.55e-01 20
ARNT bHLH 8 11.11 1.55e-01 20
Elk-1 ETS 9 10.92 3.88e-03 19
Ahr-ARNT bHLH 10 10.17 1.11e-01 25
42
C-Fos Microarray Experiment
  • In a study examining the role of transcriptional
    repression in oncogenesis, Ordway et al. compared
    the gene expression profiles of fibroblasts
    transformed by c-fos to the parental 208F rat
    fibroblast cell line
  • We mapped the list of 252 induced Affymetrix Rat
    Genome U34A GeneChip sequences to 136 human
    orthologs

43
Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed) Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input 86 analyzed)
TF Class Rank Z-score Fisher No. Genes
c-FOS bZIP 1 17.53 2.60e-05 45
RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1
PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1
CREB bZIP 4 3.626 1.25e-01 10
E2F Unknown 5 2.965 7.67e-02 15
NF-kappaB REL 6 2.915 1.04e-01 17
SRF MADS 7 2.707 2.24e-01 2
MEF2 MADS 8 2.634 1.32e-01 13
c-REL REL 9 2.467 5.79e-02 22
Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1
Ahr-ARNT bHLH 15 1.716 2.57e-03 63
deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75
Elk-1 ETS 21 0.7875 8.12e-03 37
MZF_1-4 ZN-FINGER, C2H2 27 -0.2421 5.41e-03 73
n-MYC bHLH-ZIP 30 -0.8738 8.20e-03 51
ARNT bHLH 31 -0.8738 8.20e-03 51
44
oPOSSUM Server
45
  • http//www.cisreg.ca/cgi-bin/oPOSSUM/opossum

INPUT A LIST OF CO-EXPRESSED GENES
46
SELECT YOUR TFBS PROFILES
47
  • SELECT
  • CONSERVATION
  • PSSM MATCH THRESHOLD
  • PROMOTER REGION
  • STATISTICAL MEASURE

48
de novo Discovery of TF Binding Sites
49
Pattern Discovery
50
de novo Pattern Discovery
  • Exhaustive
  • e.g. YMF (Sinha Tompa)
  • Generalization Identify over-represented
    oligomers in comparison of and - (or
    complete) promoter collections
  • Monte Carlo/Gibbs Sampling
  • e.g. AnnSpec (Workman Stormo)
  • Generalization Identify strong patterns in
    promoter collection vs. background model of
    expected sequence characteristics

51
Exhaustive methods
Word based methods How likely are X words in a
set of sequences, given sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC gtEP71002 ()
CeIV msp-56 B range -100 to -75
TTCAAATTTTAACGCCGGAATAATCTCCTATT gtEP63009 () Ce
Cuticle Col-12 range -100 to -75
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG gtEP63010 () Ce
Cuticle Col-13 range -100 to -75
TATCGTCATTCTCCGCCTCTTTTCTT gtEP11013 () Ce
vitellogenin 2 range -100 to -75
GCTTATCAATGCGCCCGGAATAAAACGCTATA gtEP11014 () Ce
vitellogenin 5 range -100 to -75
CATTGACTTTATCGAATAAATCTGTT gtEP11015 (-) Ce
vitellogenin 4 range -100 to -75
ATCTATTTACAATGATAAAACTTCAA gtEP11016 () Ce
vitellogenin 6 range -100 to -75
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT gtEP11017 () Ce
calmodulin cal-2 range -100 to -75
TTTCAAATCCGGAATTTCCACCCGGAATTACT gtEP63007 (-) Ce
cAMP-dep. PKR P1 range -100 to -75
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC gtEP63008 () Ce
cAMP-dep. PKR P2 range -100 to -75
ACTGAACTTGTCTTCAAATTTCAACACCGGAA gtEP17012 () Ce
hsp 16K-1 A range -100 to -75 TCAATGCCGGAATTCTGAA
TGTGAGTCGCCCT gtEP55011 (-) Ce hsp 16K-1 B range
52
Exhaustive methods
Over-representation How many words of type
AGGAGTGA are found in our sequences?
How likely is this result?
53
Exhaustive methods
Find all words of length 7 in the yeast genome
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGC
GTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCA
CTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTG
GAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACG
GTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAA
TCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGA
AGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGG
AAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTC
AAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAG
CTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGC
TCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTG
TTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAAT
TGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGT
ATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCC
ATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGC
CAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGA
TGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTA
TTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGT
TTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCA
GGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGA
CGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTG
GTGACACTGCCA
Make a lookup table AAACCTTT 456 TTTTTTTT 5778
8 GATAGGCA 589 Etc...
54
Probabilistic Methods for Pattern Discovery
tgacttcc
The Gibbs Sampling algorithm
tgatctct
agacctca
tgacctct
Two data structures used 1) Current pattern
nucleotide frequencies qi,1,..., qi,4 and
corresponding background frequencies pi,1,...,
pi,4 2) Current positions of site startpoints
in the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j. One starting
point in each sequence is chosen randomly
initially.
55
Probabilistic Methods for Pattern Discovery
Iteration step Remove one sequence z from the
set. Update the current pattern according to
z
A
tgacttcc
tgatctct
agacctca
tgacctct
Pseudocount for symbol j
Sum of all pseudocounts in column
56
Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
57
Four Approaches to Improve Sensitivity
  • Better background models
  • -Higher-order properties of DNA
  • Phylogenetic Footprinting
  • HumanMouse comparison eliminates 75 of
    sequence
  • Regulatory Modules
  • Architectural rules
  • Limit the types of binding profiles allowed
  • TFBS patterns are NOT random

58
Enhancing pattern detection sensitivity
TFBSs are not randomly drawn
Information segmentation Information content
distributions of TFBS are distinctly
non-random (Wasserman et al 2000)
Palindromicity, dyads (van Helden et al
2000) Variable gaps (Hu 2003)
59
Pattern discovery methods using biochemical
constraints
60
(No Transcript)
61
Our Hypothesis
  • Point 1 Structurally-related DNA binding domains
    interact with similar target sequences
  • Exceptions exist (e.g. Zn-fingers)
  • Point 2 There are a finite number of binding
    domains used in human TFs
  • Approximately 20-25
  • Idea We could use the shared binding properties
    for each family to focus pattern detection
    methods
  • Constrain the range of patterns sought

62
Comparison of profiles requires alignment and a
scoring function
  • Scoring function based on sum of squared
    differences
  • Align frequency matrices with modified
    Needleman-Wunsch algorithm
  • Calculate empirical p-values based on simulated
    set of matrices

63
Intra-family comparisons more similar than
inter-family
TF Database (JASPAR)
COMPARE
Jackknife Test 87 correct Independent Test
Set 93 correct
64
(No Transcript)
65
FBPs enhance sensitivity of pattern detection
66
(No Transcript)
67
REVIEWING THE TOP POINTS
68
OrientationRegulatory regions problem space
Sets of binding sites AATCACCAAATCACCAAATCACCA
AATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATC
TCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATA
ATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTA
GCGATTCTTCCTATGAACAGATTAAAAAGACCCCA
Specificity profiles for binding sites A -2
0 -2 -0.415 0.585 -2 -2 2.088 -2
-2 -1 0.585 C 1 0.585 0 0
-1 -2 -2 -2 2.088 -2 0.585 0.807
G 0.585 0.322 0.807 1.585 1 -2 2
-2 -2 2.088 -2 0 T 0.319 0.322
1 -2 0 2.088 -1 -2 -2 -2
1.459 -0.415
Clusters of binding sites
Transcription factors Transcription factor
binding sites Regulatory nucleotide sequences
69
Analysis of regulatory regions with TFBS
Detecting binding sites in a single sequence
Scanning a sequence against a PWM
Sp1
Abs_score 13.4 (sum of column scores)
70
Analysis of regulatory regions with TFBS
Phylogenetic Footprints
Scanning a single sequence
Scanning a pair orf orthologous sequences for
conserved patterns in conserved sequence regions
A dramatic improvement in the percentage of
biologically significant detections
  • Low specificity of profiles
  • too many hits
  • great majority not biologically significant

71
Pattern Discovery
72
Concluding Thoughts
  • Bioinformatics is often constrained by our
    understanding of biochemistry rather than
    computational or statistical limitations
  • Evolution has a powerful influence on the
    performance of many bioinformatics methods
  • Computational predictions have value, but only if
    you understand the limitations of the methods
Write a Comment
User Comments (0)
About PowerShow.com