Title: Evolutionary and genomic approaches to find gene regulatory sequences
1Evolutionary and genomic approaches to find gene
regulatory sequences
- Penn State University, Center for Comparative
Genomics and Bioinformatics Webb Miller,
Francesca Chiaromonte, Anton Nekrutenko, Kateryna
Makova, Stephan Schuster, Ross Hardison - University of California at Santa Cruz David
Haussler, Jim Kent - Childrens Hospital of Philadelphia Mitch Weiss
- NimbleGen Roland Green
University of Nebraska, Lincoln February 14. 2007
2Major goals of comparative genomics
- Identify all DNA sequences in a genome that are
functional - Selection to preserve function
- Adaptive selection
- Determine the biological role of each functional
sequence - Elucidate the evolutionary history of each type
of sequence - Provide bioinformatic tools so that anyone can
easily incorporate insights from comparative
genomics into their research
3Known types of gene regulatory regions
G.A. Maston, S.K. Evans, M.R. Green (2006) Ann.
Rev. Genomics Human Genetics 729-59.
4Regulatory regions tend to be clusters of
transcription factor binding sites
Sequence-specific
SV40 promoters and enhancer
5Properties of known regulatory regions
- Binding sites for transcription factors, many
with sequence specificity - Clusters of binding sites
- Conventional promoters encompass major start
sites for transcription - Conserved over evolutionary time???
6Structures involved in transcription are probably
more complex
Middle image Green active transcription
(Br-UTP label) Red all nucleic acids HeLa
cell Sides EM spreads of transcripts
Peter R. Cook, Oxford University,
http//users.path.ox.ac.uk/pcook/images/Images.h
tml
7Domain opening is associated with movement to
non-heterochromatic regions
Schubeler, Francastel, Cimbora, Reik, Martin,
Groudine (2000) Genes Dev. 14 940-950
8Other possible activities for sequences involved
in gene regulation
- Opening or closing a chromosomal domain
- Move a gene to or away from a transcription
factory - Control how long a gene is in a transcription
factory - Long association
- High level expression
- Really long gene
- Short association
- Lower level expression
- Rapid regulation
- Are these conserved over evolutionary time?
93 modes of evolution
Sequence matches at longer phylogenetic distances
could reflect purifying selection Sequence
differences at closer phylogenetic distances
could reflect adaptive evolution.
10Conservation vs. Constraint
- Conserved sequences are those that align between
two species thought to be descended from a common
ancestor - Constrained sequences show evidence in their
alignments of negative (purifying) selection - E.g. change at a rate significantly slower than
neutral DNA
11Ideal cases for interpretation
12Messages about evolutionary approaches to
predicting regulatory regions
- Regulatory regions are conserved, but not all to
the same phylogenetic distance. - Incorporation of pattern and composition
information along with with conservation can lead
to effective discrimination of functional classes
(regulatory potential). - Regulatory potential in combination with
conservation of a GATA-1 binding motif is an
effective predictor of enhancer activity. - In vivo occupancy by GATA-1 suggests other
activities in addition to enhancers. - Comparison of polymorphism and divergence from
closely related species can reveal regulatory
regions that are under recent selection.
13Finding all gene regulatory regions is a
challenge for comparative genomics
- Known regulatory regions for the HBB complex
- 23 total
- 19 conserved (align) between human and mouse
- Many others show no significant difference in a
measure of constraint (phastCons) from the bulk
or neutral DNA
14Two extremes of constraint in TRRs
15ENCODE projects
- ENCODE (ENCyclopedia Of DNA Elements) consortium
aiming to find function for all human DNA
sequences - Phase I focused on 1 of human DNA
- 30 Mb, 44 regions
- About 10 regions had known genes of interest
(CFTR, HOX) - Others were chosen to get a sampling of regions
varying in gene density and alignability with
mouse - Major areas
- Genes and transcripts
- Transcriptional regulation
- Chromatin structure
- Multiple sequence alignment
- Variation in human populations
16Biochemical assays for protein-binding sites in
DNA
Purified protein Naked DNA
Chromatin Immunoprecipitation DNA sites occupied
by a protein inside cells.
17ChIP-on-chip to examine many sites
18Putative transcriptional regulatory regions
pTRRs
- Antibodies vs 10 sequence-specific factors
- Sp1, Sp3, E2F1, E2F4, cMyc, STAT1, cJun, CEBPe,
PU1, RA Receptor A - High resolution ChIP-chip platforms Affymetrix
and NimbleGen - Data from several different labs in ENCODE
consortium - High likelihood hits for ChIP-chip
- 5 false discovery rate
- Supported by chromatin modification data
- Modified histones in chromatin H4Ac, H3Ac,
H3K4me, H3K4me2, H3K4me3, etc. - DNase hypersensitive sites (DHSs) or nucleosome
depleted sites - Result set of 1369 pTRRs
19A small fraction of cis-regulatory modules are
conserved from human to chicken
- About 4 of pTRRs, 4 of DNase HSs, 4-7 of
promoters active in multiple cell lines - Tend to regulate genes whose products control
transcription and development
Millions of years
91
173
310
450
David King
20Most pTRRs are conserved in eutherian mammals
Percentage of class that align no further than
pTRRs
DNase HSs
Promoters
Primates 3
11
1-13
Millions of years
91
Eutherians 71
70
63
173
310
Marsupials 21
14
16-28
450
Tetrapods 4
4
4-7
Vertebrates 1
1
2-4
Within aligned noncoding DNA of eutherians, need
to distinguish constrained DNA (purifying
selection) from neutral DNA.
21Measures of conservation and constraint capture
only a subset of pTRRs
Fraction overlapping an MCS
phastCons (background rate corrected)
Composite alignability (background rate
corrected)
Aligns, but no inference about purifying selection
Allows a range of constraint
Stringent constraint
22Different measures perform better on specific
functional regions
Sensitivity
1-Specificity
23Examples of clade-specific pTRRs
24Messages about evolutionary approaches to
predicting regulatory regions
- Regulatory regions are conserved, but not all to
the same phylogenetic distance. - Incorporation of pattern and composition
information along with with conservation can lead
to effective discrimination of functional classes
(regulatory potential). - Regulatory potential in combination with
conservation of a GATA-1 binding motif is an
effective predictor of enhancer activity. - In vivo occupancy by GATA-1 suggests other
activities in addition to enhancers. - Comparison of polymorphism and divergence from
closely related species can reveal regulatory
regions that are under recent selection.
25Regulatory potential (RP) to distinguish
functional classes
26Good performance of ESPERR for gene regulatory
regions (RP)
-
Francesca Chiaromonte
James Taylor
27Messages about evolutionary approaches to
predicting regulatory regions
- Regulatory regions are conserved, but not all to
the same phylogenetic distance. - Incorporation of pattern and composition
information along with with conservation can lead
to effective discrimination of functional classes
(regulatory potential). - Regulatory potential in combination with
conservation of a GATA-1 binding motif is an
effective predictor of enhancer activity. - In vivo occupancy by GATA-1 suggests other
activities in addition to enhancers. - Comparison of polymorphism and divergence from
closely related species can reveal regulatory
regions that are under recent selection.
28Conservation of predicted binding sites for
transcription factors
Binding site for GATA-1
29Genes Co-expressed in Late Erythroid Maturation
G1E-ER cells proerythroblast line lacking the
transcription factor GATA-1. Can rescue by
expressing an estrogen-responsive form of
GATA-1 Rylski et al., Mol Cell Biol. 2003
30Predicted cis-Regulatory Modules (preCRMs) Around
Erythroid Genes
BYong Cheng, Ross, Yuepin Zhou, David
King FYing Zhang, Joel Martin, Christine Dorman,
Hao Wang
31preCRMs with conserved consensus GATA-1 BS tend
to be active on transfected plasmids
32preCRMs with conserved consensus GATA-1 BS tend
to be active after integration into a chromosome
33Examples of validated preCRMs
34Correlation of Enhancer Activity with RP Score
35Validation status for 99 tested fragments
36preCRMs with High RP and Conserved Consensus
GATA-1 Tend To Be Validated
37CACC box helps distinguish validated from
nonvalidated preCRMs
Ying Zhang
38Messages about evolutionary approaches to
predicting regulatory regions
- Regulatory regions are conserved, but not all to
the same phylogenetic distance. - Incorporation of pattern and composition
information along with with conservation can lead
to effective discrimination of functional classes
(regulatory potential). - Regulatory potential in combination with
conservation of a GATA-1 binding motif is an
effective predictor of enhancer activity. - In vivo occupancy by GATA-1 suggests other
activities in addition to enhancers. - Comparison of polymorphism and divergence from
closely related species can reveal regulatory
regions that are under recent selection.
39preCRMs with conserved consensus GATA-1 binding
sites are usually occupied by that protein ChIP
assay
40Design of ChIP-chip for occupancy by GATA-1
- Non-overlapping tiling array with 50bp probe and
100bp resolution (NimbleGen) - Cover range
- Mouse chr757225996-123812258 (70Mbp)
- 3. Antibody against the ER portion of
GATA-1-ER protein in rescued G1E-ER4 cells
Yong Cheng, with Mitch Weiss Lou Dore (CHoP),
Roland Green (NimbleGen)
41Signals in known occupied sites in Hbb LCR
HS1
HS2
HS3
1) Cluster of high signals 2) hill shape of the
signals
42Peak Finding Programs
- TAMALPAIS
- Mark Bieda from Peggy Farmhams lab
- Focus more on the cluster of the signals
- 4 thresholds based on number of consecutive
probes with signals in the 98th or 95th
percentiles - MPEAK
- Bing Rens lab
- Focus more one the hill shape of the signal
- 4 thresholds, for a series of probes with at
least one that is 3, 2.5, 2 or 1 standard
deviations above the mean
43ChIP-chip hits for GATA-1 occupancy
Technical replicates of ChIP-chip with antibody
against GATA1-ER
Mpeak
TAMALPAIS
275 hits in both
276 hits in both
216
60
59
321 total ChIP-chip hits
44ChIP-chip hits validate at a high rate
Validation determined by quantitative PCR. 19 of
the 321 hits were tested. 13 (70) were
validated.
ChIP DNA
Validation rate is similar at different thresholds
9 regions were hits in only one of the two
technical replicates. None were validated.
45Association of WGATAR and conservation with
ChIP-chip Hits
- 249 out of the 321 (78) have WGATAR motifs,
binding site for GATA-1 - Of the GATA-1 binding motifs in those 249 hits,
112 (45) are conserved between mouse and at
least one non-rodent species.
46Expected and unexpected ChIP-chip hits
47Distribution of ChIP-chip hits on 70Mb of mouse
chr7
Yong Cheng, Yuepin Zhou and Christine Dorman
48Almost half the GATA-1 ChIP-chip hits increase
expression of a transgene, K562 cells
15
6
6
No GATA-1
GATA-1 occupied sites by ChIP-chip
24 validated out of 56 fragments with ChIP-chip
hits tested 43
49Conserved and nonconserved ChIP-chip hits can be
active as enhancers
50Messages about evolutionary approaches to
predicting regulatory regions
- Regulatory regions are conserved, but not all to
the same phylogenetic distance. - Incorporation of pattern and composition
information along with with conservation can lead
to effective discrimination of functional classes
(regulatory potential). - Regulatory potential in combination with
conservation of a GATA-1 binding motif is an
effective predictor of enhancer activity. - In vivo occupancy by GATA-1 suggests other
activities in addition to enhancers. - Comparison of polymorphism and divergence from
closely related species can reveal regulatory
regions that are under recent selection.
51Polymorphism as a transient phase of evolution
Slide from Dr. Hiroshi Akashi
52Test of neutrality using polymorphism and
divergence data
53Test for recent selection in human noncoding DNA
- McDonald-Kreitman test
- Use ancestral repeats as neutral model (MKAR
test) - Count polymorphisms in human using dbSNP126
- Count divergence of human from
- Chimpanzee (great Ape, diverged from human
lineage 6 Myr ago) - Rhesus macaque (Old World Monkey, diverged from
human lineage 23 Myr ago) - Tiled windows, most analysis on 10kb windows
- Compute p-value for neutrality by chi-square test
- Ratio of polymorphism to divergence ratios gives
indication of direction of inferred selection
Heather Lawson, Anthropology, PSU
54pTRR apparently under positive selection
55A promoter distal to the beta-like globin genes
has a signal for recent purifying selection
56Selection on a primate-specific promoter
57The distal promoter is close to the locus control
region for beta-globin genes
58Messages about evolutionary approaches to
predicting regulatory regions
- Regulatory regions are conserved, but not all to
the same phylogenetic distance. - Incorporation of pattern and composition
information along with with conservation can lead
to effective discrimination of functional classes
(regulatory potential). - Regulatory potential in combination with
conservation of a GATA-1 binding motif is an
effective predictor of enhancer activity. - In vivo occupancy by GATA-1 suggests other
activities in addition to enhancers. - Comparison of polymorphism and divergence from
closely related species can reveal regulatory
regions that are under recent selection.
59Many thanks
PSU Database crew Belinda Giardine, Cathy
Riemer, Yi Zhang, Anton Nekrutenko
BYong Cheng, Ross, Yuepin Zhou, David
King FYing Zhang, Joel Martin, Christine Dorman,
Hao Wang
RP scores and other bioinformatic
input Francesca Chiaromonte, James Taylor, Shan
Yang, Diana Kolbe, Laura Elnitski
Alignments, chains, nets, browsers, ideas, Webb
Miller, Jim Kent, David Haussler
Funding from NIDDK, NHGRI, Huck Institutes of
Life Sciences at PSU
60Computing Regulatory Potential (RP)
Alignment seq1 G T A C C T A C T A C G C A
seq2 G T G T C G - - A G C C C A
seq3 A T G T C A - - A A T G T A
Collapsed alphabet 1 2 1 3 4 5 7 7 6 8 3 6 3 9
- A 3-way alignment has 124 types of columns.
Collapse these to a smaller alphabet with
characters s (for example, 1-9).
- Train two order t Markov models for the
probability that t alignment columns are followed
by a particular column in training sets - positive (alignments in known regulatory regions)
- negative (alignments in ancestral repeats, a
model for neutral DNA) - E.g. Frequency that 3 4 is followed by 5
- 0.001 in regulatory regions
- 0.0001 in ancestral repeats
- RP of any 3-way alignment is the sum of the log
likelihood ratios of finding the strings of
alignment characters in known regulatory regions
vs. ancestral repeats.
61Stage 1 Reduced representations
ESPERR Evolutionary Sequence and Pattern
Extraction using Reduced Representations
62Stage 2 Improve encoding
63Train models for classification
64Categories of Tested DNA Segments
65Example that suggests turnover
GATA-1 BSs
66Additional methods find CACC box as distinctive
for validation
All validated preCRMs
All nonvalidated preCRMs
CLOVER (Zlab)
Hexamer Counting
Background
ELPH (UMaryland)
EKLF PWM (Dr. Perkins)
Mouse chr 19 (42.8 CG) - NCBI Build 30
67Using Galaxy to find predicted CRMs