5.1: Gene Regulation and Promoter Analysis - PowerPoint PPT Presentation

1 / 97
About This Presentation
Title:

5.1: Gene Regulation and Promoter Analysis

Description:

Profiles make far too many false predictions to have predictive value in isolation ... Validation of predictions for tissues/cells not well represented in cell culture ... – PowerPoint PPT presentation

Number of Views:252
Avg rating:3.0/5.0
Slides: 98
Provided by: stephe78
Category:

less

Transcript and Presenter's Notes

Title: 5.1: Gene Regulation and Promoter Analysis


1
5.1 Gene Regulation and Promoter Analysis
  • Wyeth Wasserman
  • Centre for Molecular Medicine and Therapeutics
  • Childrens and Womens Hospital
  • Department of Medical Genetics
  • University of British Columbia

www.cisreg.ca
2
Overview
  • 5.1.0 Bioinformatics for detection of
    transcription factor binding sites
  • The Specificity Problem
  • 5.1.1 Discrimination of regulatory control
    sequences
  • Based on knowledge of established TFBS
  • 5.1.2 Discovery of regulatory mechanisms
  • Based on de novo pattern discovery
  • 5.1.3 Impending advances

3
Layers of Complexity in Metazoan Transcription
4
Transcription Simplified
URF
Pol-II
TATA
URE
5
5.1.0 Profile Models for Prediction of TF Binding
Sites
6
Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA
  • A single site
  • AAGTTAATGA
  • A set of sites represented as a consensus
  • VDRTWRWWSHD (IUPAC degenerate DNA)

7
PFMs to PWMs
One would like to add the following features to
the model 1. Correcting for the base
frequencies in DNA 2. Weighting for the
confidence (depth) in the pattern 3. Convert to
log-scale probability for easy arithmetic
w matrix
f matrix
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
f(b,i) s(N)
Log ( )
p(b)
8
JASPAR OPEN-ACCESS DATABASE OF TF BINDING
PROFILES (Some other databases with TF profiles
include Transfac, TRRD, mPromDB, SCPD (yeast),
dbTBS and EcoTFS (bacteria))
9
Performance of Profiles
  • 95 of predicted sites bound in vitro (Tronche
    1997)
  • MyoD binding sites predicted about once every 600
    bp (Fickett 1995)
  • Futility Theorem
  • Nearly 100 of predicted TFBS have no function in
    vivo
  • Brazma claims it should be called the futility
    conjunction

10
1000bp promoter screened with collection of TF
profiles (beta-globin)
11
5.1.1 Pattern Discrimination
  • Overcoming the specificity problem by
    incorporating biological knowledge into
    computational algorithms

12
Phylogenetic Footprinting
  • 70,000,000 years of evolution reveals most
    regulatory regions.

13
SIDENOTE Global Progressive Alignments(e.g.
ORCA, AVID, LAGAN)
  • Global alignments memory product of sequence
    lengths
  • Progressive alignment by banding with local
    alignments (e.g. BLAST) and running global method
    on banded sub-segments
  • Recursion with decreasingly stringent parameters

14
Phylogenetic Footprinting to Identify Functional
Segments
Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse.
15
Phylogenetic Footprinting (cont)
FoxC2
100 80 60 40 20 0
Identity
Start Position of 200bp Window
16
Recall...
17
1000bp beta-globin promoter screened with
phylogenetic footprinting
18
Choosing the right species...Genes evolve at
different rates make gene-specific choice
CHICKEN
HUMAN
MOUSE
HUMAN
COW
HUMAN
19
Performance Human vs. Mouse Pairwise
SELECTIVITY
SENSITIVITY
  • Testing set 40 experimentally defined sites in
    15 well studied genes (Replicated with 100 site
    set)
  • 85-95 of defined sites detected with
    conservation filter, while only 11-16 of total
    predictions retained

20
ConSite
Now driven by the ORCA Aligner
21
Selected Emerging Issues
  • Multiple sequence comparisons
  • Incorporate phylogenetic distances into a scoring
    metric
  • Visualization (see dcode service and Sockeye)
  • Analysis of many closely related species
  • Phylogenetic shadowing
  • Genome rearrangements
  • Inversion compatible alignment algorithm
  • LAGAN
  • Higher order models of TFBS

22
Regulatory Modulesfor better specificity
  • TFs do NOT act in isolation

23
Layers of Complexity in Metazoan Transcription
24
Liver regulatory modules
25
PSSMs for Liver TFs
HNF3
HNF1
HNF4
C/EBP
26
Detection of Clusters of TFBS
  • In the best cases, we have enough data to train a
    discriminant function
  • Rare to have sufficient data
  • Alternatively, identify dense clusters of sites
    that are statistically significant
  • Diverse methods have been introduced over the
    past few yearsBerman Markstein Frith Noble
    Wagner
  • Non-trivial to correct for non-random properties
    of DNA
  • Most difficulty comes from local direct repeats
  • A primary challenge from the biological side is
    the selection of a meaningful grouping of TFs
  • Multiple testing problems severe

27
TFBS Clusters(MSCAN, MCAST, COMET, etc)
  • MSCAN allows users to submit any set of TF
    profiles
  • Calculates significance for each site based on
    local sequence characteristics
  • G-rich PSSM gets less weight on G-rich region of
    gene
  • Calculates cluster significance using a dynamic
    programming approach
  • Approximately 1 significant liver cluster / 18
    000 bp in human genome sequence
  • Filters to remove significant clusters of sites
    that contain local repeats
  • Identification of non-random characteristics in
    DNA

http//mscan.cgb.ki.se
28
Training predictive models for modules
  • Not every combination of sites is meaningful
  • Reality Some factors critical, others secondary
  • An alternative is to teach the computer which
    combinations are better
  • Limited by small size of positive training set
  • Explore an older method based on Logistic
    Regression Analysis

29
Recall Liver regulatory modules
30
Logistic Regression Analysis
a1 a2 a3 a4
Optimize a vector to maximize the distance
between output values for positive and negative
training data. Output value is
elogit
p(x) 1
elogit
S
logit
31
PERFORMANCE
  • Liver (Genome Research, 2001)
  • At 1 hit per 35 kbp, identifies 60 of modules
  • Limited to genes expressed late in liver
    development

32
UDPGT1 (Gilberts Syndrome)
Wildtype Mutant
Liver Module Model Score
Window Position in Sequence
33
Making better predictions
  • Profiles make far too many false predictions to
    have predictive value in isolation
  • Phylogenetic footprinting eliminates about 90 of
    false predictions while retaining 70-70 of real
    sites (human vs mouse)
  • Detection of clusters of binding sites offers
    better predictive performance, especially through
    trained discrimination functions

34
Active Issues
  • Significance of clusters of sites
  • Segmentation of DNA into regions of different
    composition
  • Methods using training to find clusters
  • Where to place weights?
  • Interaction weighting in the absence of large
    data collections
  • Resources
  • Limited number of solid PSSMs
  • Need a reference database for functional
    regulatory regions
  • Validation of predictions for tissues/cells not
    well represented in cell culture

35
EMERGING APPLICATIONRegulatory Analysis of
Variation in ENhancers
  • Genetic variation in TFBS can result in
    biomedically important phenotypes

36
Sequence Variation in TFBS
URF
TSS
AaGT
37
Stage 1Prediction of Regulatory Regions
38
Stage 1 Predict Regulatory Regions
  • Retrieve orthologous human and mouse gene
    sequences
  • Align sequences with a global aligner (ORCA)
  • Identify regions of conservation
  • Designs primers for SNP discovery

FoxC2
100 80 60 40 20 0
39
SIDENOTE Data/Orthology obtained from GeneLynx
(www.genelynx.org)
40
Stage 2Analysis of Polymorphisms
ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGA
T ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACA
GAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAA
CAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAAT
AACAGAT ACGCATAAGTTAACGAATAACAGAT
41
Identify variations that generate allele-specific
binding sites (predicted)
Differences in scores
Pseudo-data for instructional purposes
1234567890123456789012345 ACGCATAAGTTAAtGAATAACAGA
T .............c...........
42
RAVEN screenshots
43
5.1.2 Discovery of Mediating TFBS for Sets of
Co-Regulated Genes
  • Finding characteristics over-represented in a set
    of co-regulated genes

44
Pattern Discovery
45
Linking co-expressed genes from microarrays to
candidate TFs
46
oPOSSUM Project
  • A significant subset of TFs are represented by
    existing binding profiles
  • Within same structural class, often binding
    specificity retained (more on this later)
  • Can we link known TFs to a putative regulon by
    over-representation of predicted binding sites in
    promoters?
  • Identical concept to the detection of
    over-represented GO terms from previous session

47
oPOSSUM Procedure
48
Reference Gene Sets
Fisher p-values plt1e-05, plt1e-02
49
MICROARRAY APPLICATIONNF-kB Inhibitor-sensitive
genes (326)
plt1e-30, plt1e-10, plt1e-05, plt1e-02
50
oPOSSUM Server
51
Over-represented Site Combinations (Kreiman 2004)
  • Based on our understanding of CRMs, likely that
    combinations of sites would be more distinguished
    than individual sites (better signal-to-noise?)
  • Kreiman has introduced a system to assess
    clusters of neighbouring conserved sites based on
    counting
  • Hypergeometric distribution, simply compare the
    frequency of the cluster occurrence vs.
    expectation

52
What if the TFBS is novel?
53
de novo Pattern Discovery Methods
  • String-based
  • e.g. Moby Dick (Bussemaker, Li Siggia)
  • Identify over-represented oligomers in comparison
    of and - (or complete) promoter collections
  • Profile-based
  • Monte Carlo/Gibbs Sampling
  • e.g. AnnSpec (Workman Stormo)
  • Identify strong patterns in promoter
    collection vs. background model of expected
    sequence characteristics

54
String-base Exhaustive Methods
Word-based methods How likely are X words in a
set of sequences, given sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC gtEP71002 ()
CeIV msp-56 B range -100 to -75
TTCAAATTTTAACGCCGGAATAATCTCCTATT gtEP63009 () Ce
Cuticle Col-12 range -100 to -75
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG gtEP63010 () Ce
Cuticle Col-13 range -100 to -75
TATCGTCATTCTCCGCCTCTTTTCTT gtEP11013 () Ce
vitellogenin 2 range -100 to -75
GCTTATCAATGCGCCCGGAATAAAACGCTATA gtEP11014 () Ce
vitellogenin 5 range -100 to -75
CATTGACTTTATCGAATAAATCTGTT gtEP11015 (-) Ce
vitellogenin 4 range -100 to -75
ATCTATTTACAATGATAAAACTTCAA gtEP11016 () Ce
vitellogenin 6 range -100 to -75
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT gtEP11017 () Ce
calmodulin cal-2 range -100 to -75
TTTCAAATCCGGAATTTCCACCCGGAATTACT gtEP63007 (-) Ce
cAMP-dep. PKR P1 range -100 to -75
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC gtEP63008 () Ce
cAMP-dep. PKR P2 range -100 to -75
ACTGAACTTGTCTTCAAATTTCAACACCGGAA gtEP17012 () Ce
hsp 16K-1 A range -100 to -75 TCAATGCCGGAATTCTGAA
TGTGAGTCGCCCT gtEP55011 (-) Ce hsp 16K-1 B range
55
Exhaustive methods(2)
Find all words of length 7 in the yeast genome
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGC
GTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCA
CTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTG
GAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACG
GTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAA
TCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGA
AGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGG
AAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTC
AAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAG
CTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGC
TCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTG
TTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAAT
TGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGT
ATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCC
ATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGC
CAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGA
TGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTA
TTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGT
TTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCA
GGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGA
CGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTG
GTGACACTGCCA
Make a lookup table TTTTTTTT/aaaaaaa 57788 GATAG
GCA/tgcctatc 589 AAACCTTT/aaaggttt
456 Etc...
56
Exhaustive methods(3)
Over-representation How many words of type
AGGAGTGA are found in our sequences?
How likely is this result?
57
Exhaustive methods(4)
Modeling Properties of DNA
Simple How likely are single nucleotides?
(extended Bernoulli) Complex Neglect certain
words Locations of TFBS Higher-order
descriptions of DNA
58
Exhaustive methods Key items
  • Algorithms with high complexity - Large sequences
    and/or many possible word lengths not possible
  • Often string-based
  • TFBS are not words (fuzzy binding)
  • Sensitivity susceptible to noisy indata (e.g.
    microarrays)

59
Profile-based Methods(usually probablistic)
Find a local alignment of width x of sites that
maximizes information content in reasonable
time Usually by Gibbs sampling or EM methods
Motivations TFBS are not words Efficiency Can be
intentionally influenced by biological data
60
Profile Methods (2)
tgacttcc
The Gibbs Sampling algorithm
tgatctct
agacctca
tgacctct
Two data structures used 1) Current pattern
nucleotide frequencies qi,1,..., qi,4 and
corresponding background frequencies pi,1,...,
pi,4 2) Current positions of site startpoints
in the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j. One starting
point in each sequence is chosen randomly
initially.
61
Profile Methods (3)
Iteration step Remove one sequence z from the
set. Update the current pattern according to
z
A
tgacttcc
tgatctct
agacctca
tgacctct
Pseudocount for symbol j
Sum of all pseudocounts in column
62
Pattern Discovery Across Orthologous Promoters
from Gram-Positive Bacteria
Real sets
random
63
EXAMPLEYeast Regulatory Sequence Analysis (YRSA)
system
64
Tests of YRSA System
65
SIDENOTE Comparison of profiles requires
alignment and a scoring function
  • Scoring function based on sum of squared
    differences
  • Align frequency matrices with modified
    Needleman-Wunsch algorithm
  • Calculate empirical p-values based on simulated
    set of matrices

66
How is the Performance Hit and Miss
67
Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
68
Over-coming the sensitivity challenge
  • Metazoan genomes are far from ideal

69
Biochemical complexity enables greater complexity
in regulation
70
Four Approaches to Improve Sensitivity
  • Better background models
  • -Higher-order properties of DNA
  • Phylogenetic Footprinting
  • HumanMouse comparison eliminates 75 of
    sequence
  • Regulatory Modules
  • Architectural rules
  • Limit the types of binding profiles allowed
  • TFBS patterns are NOT random

71
Phylogenetic Footprinting to Identify Conserved
Regions
Bayes Block Aligner (Lawrence Group)
ORCA
72
Skeletal Muscle Genes
  • One of the most extensively studied tissues for
    transcriptional regulation
  • 45 genes partially analyzed
  • 26 genes with orthologous genomic sequence from
    human and rodent
  • Five primary classes of transcription factors
  • Principal Myf (myoD), Mef2, SRF
  • Secondary Sp1 (G/C rich patches), Tef (subset of
    skeletal muscle types)

73
de novo Discovery of Skeletal Muscle
Transcription Factor Binding Sites
Mef2-Like
SRF-Like
Myf-Like
74
Pattern discovery methods using biochemical
constraints
75
RECALL Gibbs Algorithm
z
tgacttcc
tgatctct
agacctca
tgacttcc
tgacctct
tgatctct
agacctca
tgacctct
76
(No Transcript)
77
Intra-family PSSM similarity
TF Database (JASPAR)
COMPARE
Jackknife Test 87 correct Independent Test
Set 93 correct
78
(No Transcript)
79
FBPs enhance sensitivity of pattern detection
80
(No Transcript)
81
APPLICATIONCancer Protection Response
  • Detoxification-related enzymes are induced by
    compounds present in Broccoli
  • Arrays, SSH and hard work have defined a set of
    responsive genes
  • A known element mediates the response
    (Antioxidant Responsive Element)
  • Controversy over the type of mediating leucine
    zipper TF
  • NF-E2/Maf or Jun/Fos

82
Application (2)
Problem Given a set of co-regulated genes,
determine the common TFBS. Classify the
mediating TF. We expect a leucine zipper-type
TF.
83
Application (3)
Problem Given a set of co-regulated genes,
determine the common TFBS. Classify the
mediating TF. We expect a leucine zipper-type
TF.
84
Application (4)
Problem Given a set of co-regulated genes,
determine the common TFBS. Classify the
mediating TF. We expect a leucine zipper-type
TF.
85
EMERGING METHODde novo Analysis of Regulatory
Modules
86
Focus on regulatory modules for pattern detection
Cluster Genes by Expression
87
Analyze co-regulated genes to define circuit
characteristics
General Circuit Properties
Specific Gene Features
Binding Profiles
mi

aij
Neighbor Interactions
mi
mj
mi
mj
0
b
Width Distributions (Sum of Separations)
250
88
Discovery performance
  • Approximately 50 of annotated TFBS are detected
    in the training set sequences of 25 genes
  • Only 40 of predicted TFBS are annotated
  • We suspect that most of the un-annotated sites
    will turn out to be functional. This needs to be
    determined.

89
Review of Primary Points
  • Second Chance

90
Regulatory regions problem space
Sets of binding sites AATCACCAAATCACCAAATCACCA
AATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATC
TCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATA
ATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTA
GCGATTCTTCCTATGAACAGATTAAAAAGACCCCA
Specificity profiles for binding sites A -2
0 -2 -0.415 0.585 -2 -2 2.088 -2
-2 -1 0.585 C 1 0.585 0 0
-1 -2 -2 -2 2.088 -2 0.585 0.807
G 0.585 0.322 0.807 1.585 1 -2 2
-2 -2 2.088 -2 0 T 0.319 0.322
1 -2 0 2.088 -1 -2 -2 -2
1.459 -0.415
Clusters of binding sites
Transcription factors Transcription factor
binding sites Regulatory nucleotide sequences
91
Detecting binding sites in a single sequence
Scanning a sequence against a PWM
Sp1
Abs_score 13.4 (sum of column scores)
Is 93 better than 82?
92
Phylogenetic Footprints
Scanning a single sequence
Scanning a pair orf orthologous sequences for
conserved patterns in conserved sequence regions
A dramatic improvement in the percentage of
biologically significant detections
  • Low specificity of profiles
  • too many hits
  • great majority are not biologically significant

93
Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
94
Acknowledgements
  • Wasserman Group
  • Wynand Alkema
  • Dave Arenillas
  • Jochen Brumm
  • Alice Chou
  • Shannan Ho Sui
  • Danielle Kemmer
  • Jonathan Lim
  • Raf Podowski
  • Dora Pak
  • Albin Sandelin
  • Chris Walsh
  • Collaborating Trainees
  • Malin Andersson (KTH)
  • Öjvind Johansson (UCSD)
  • Stuart Lithwick (U.Toronto)

Collaborators Boris Lenhard (K.I.) Chip Lawrence
(Wadsworth) William Thompson (Wadsworth) Jens
Lagergren (KTH) Christer Höög (K.I.) Brenda
Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg
(AZ) William Hayes (AZ) James Mortimer
(MF) Group Alumni Elena Herzog Annette
Höglund William Krivan Luis Mendoza
Support CIHR, CGDN, CFI, Merck-Frosst, BC
Childrens Hospital Foundation, Pharmacia,
ECMarie Curie, KI-Funder
95
EXTRA SLIDESWhat will a computational biologist
do with a scoring function?
  • Build a similarity tree!

96
The matrix tree
97
Compare with consensus for both classes - CANNTG
Write a Comment
User Comments (0)
About PowerShow.com