Title: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes
1Evolutionarily conserved elements in vertebrate,
insect, worm, and yeast genomes
- Junguk Hur
- School of Informatics
Adam Siepel et al., Genome Research, 2005 15
1034-1050
2Keywords
- Conserved elements
- phastCons
- Two-state Phylogenetic Hidden Markov Model
(Phylo-HMM) - Functional categories of bases (elements) (CDS,
UTR, other mRNA, intron, and unannotated)
3Background-I
- Many completed genome sequencing projects.
- Q) How much of vertebrate genomes are directly
functional? And exactly which regions? - More is known in model organisms but still much
remains to be learned. - Methods for identifying sequences likely to be
functional are of critical importance. - Whats best? Look for conserved sequences across
species. - Due to negative (purifying) selection
4Background-II
- About 5 of mammalian genome under purifying
selection. (Smith et al. 2004 and rodent genome
projects) - 1.5 ? protein-coding genes
- 3.5 ? functional but not coding proteins (Dark
Matter) - Mostly done by pair-wise alignments and simple,
percent-identity based method - VISTA (Mayor et al. 2000), PipMaker (Schwartz el
al. 2000), zPicture (Ovcharenko et al. 2004), and
etc - Some by multiple alignments to consider phylogeny
- Stojanovic et al. 1999 Boffelli et al. 2003
Margulies et al. 2003 Chapman et al. 2004
Cooper et al. 2004 Ovcharenko et al. 2005 - Limits sliding windows, little use of branch
length of phylogeny (allowing for multiple
substitution per site)
5Preview of result by phastCons
Worm
Vertebrate
Yeast
Insect
Vertebrate 38 Insect 3753 Worm
1837 Yeast 4768
6phastCons
- To identify conserved elements in multiply
aligned sequences. - Based on phylogenetic hidden Markov model
(phylo-HMM) ? considering 1) process by which
nucleotide substitutions occur 2) how this
process changes from one site to next site
7Two-State Phylo-HMM
8Phylo-HMM Model
- Q 4x4 substitution rate matrix
- ? background probability of A,G,C,T
- ? binary tree representing the topology of the
phylogeny - ? branch lengths of the phylogeny
- ? scaling factor
- Free parameters to be estimated
- ?, ? transition probabilities
9Phylo-HMM
- Let multiple
alignments - emission
probability - Ancestral bases associated with Xi are unknown
- Marginal probabilities by summing over all
ancestral bases - Let be a path
through phylo-HMM - must sum over all paths.
- Joint probability of the data and a specific
path - Expectation maximization
10Topology of phylogeny
11Genomes surveyed
- Five vertebrate species
- Human, mouse, rat, chicken, and fugu
- Four insect species
- Three species of Drosophila and Anopheles gambiae
- Two worm species
- Seven yeast species
12Calibration
- Constrained the model parameters such that
- the coverage of known coding regions by predicted
conserved elements was equivalent. - Target coverage of 65 ( 1)
- Estimated from human/mouse comparison
- Resulting coverage
- Worm 56
- Yeast 68
- Insect 68
13phastCon result
14Categories of elements
15Vertebrate
CDS 66 coverage (15.5-fold) 5UTR 23
(5.3-fold) 3UTR 18 (4.3-fold) Other mRNA
9.2 (2.1-fold) Other trans 7.5 (1.8-fold)
Intron 3.6 Unannotated 2.7
16Vertebrate
56 of putative RNA genes ? Reasonably
sensitive for functional noncoding sequences
lt1 of the bases in mammalian ancestral repeats
(ARs) (believed to be neutrally
evolving) ? Suggesting low
false-positive rate for prediction ?
Simulation experiments showed lt0.3 of false
positive rate
17Insect, Worm, and Yeast
Yeast
Insect
CDS, UTR, other mRNA gt substantially higher
coverage Intron, unannotated gt lower than
expected
Worm
18Composition of CE by Annotation
From vertebrate to yeast (decreasing order of
genome size, complexity) larger fraction of
CDS, UTR smaller fraction of introns and
unannotated Most conserved in worm yeast
? protein coding vertebrate insect
? non-protein coding Consequence of increasing
gene density? But should not underscore the
functional importance of noncoding regions in
eukaryotes
19Lengths of conserved elements
- The lengths of predicted elements for all four
data sets were approximately geometrically
distributed, - 103.8bp for the vertebrates
- 120.6bp for the insects
- 268.8bp for the worms
- 99.6bp for the yeasts.
- The differences ? available phylogenetic
information rather than biological
characteristics - More phylogenetic information at each site
- detection of shorter elements
- long elements more likely to be broken up?
overall decrease in the average lengths of
predicted elements
20Lengths of conserved elements
- Average length in different annotation types
- ARs lt introns and unannotated lt UTR lt CDS
21Base-by-base conservation score
- Log-odds score to each predicted element.
- UCSC Genome Browser
22Highly Conserved Elements (HCEs)
- Top CEs of 5000 (V,I) or 1000 (W,Y)
- Similar to UCEs (Bejerano et al. 2004) but
- Longer than UCEs
- Less extreme sequence conservation
- 10-fold larger than set of UCEs
- 80 of Human/Rodent UCEs are included in HCEs
(the other 20 tend to be short with mean length
of 231 bp) - Vertebrate HCEs 0.14 of human genome
- Longer on average (781 bp)
- CDS (22-fold), UTR (11-fold) but 42 are in known
exons. - 19 in known introns
- 32 in completely unannotated
23Coverage of vertebrate CE
24HCEs in other sets of genomes
25HCEs in 3 UTR of vertebrate genes
- Unexpectedly large fraction in UTR (overall 5.6)
- 9.6 in 5K, 12.5 in 1K, 14.3 in top 100.
- 5 UTR? ? 1.1 -gt 1.5 and absent in top 100
- Some extreme conservation (3 UTR)
- DNA- and RNA-binding genes
- 3 UTRs key role in critical regulatory network
- Many conserved 3 UTR sequences ? involved in
post-transcriptional regulation ? sub-cellular
localization, transcript stability or
translability. - miRNA regulation
- Regulatoin through antisense transcription
26Secondary structure in HCEs
- Several known post-transcriptional regulation
through stem-loop structure in UTR - Tested in UTRs for secondary structure
- Used phylo-SCFG to calculate FPS (folding
potential score)
27FPS (Folding Potential Score)
Intron FPS?? 3 UTR FPS 5 UTR ? Intergenic ?
3UTR ? Many intronic/intergenic HCEs may
function at RNA level
28Functional Enrichment - vertebrate
- Trans-dev transcription and development related
- 3UTR Highly enriched in mRNA metabolism, mRNA
processing, ubiquitin cycle ?possible role in
post-transcriptional regulation
29Functional Enrichment others (CDS)
30HCEs and gene desert
- Unusually large intergenic regions 545
recently analyzed by Ovcharenko et al. - Min length 640 kb covering 25 of human genome
- Low GC content
- High SNP rate
- Decreased fractions of conservation
- Two types
- Stable desert higher conservation, flaking genes
enriched in trans-dev. Absence of synteny break - Only 12 in intergenic bases. 53 of intergenic
HCEs - Variable desert
- 30 of intergenic bases. Only 2.2 of intergenic
HCEs - Substantially enriched in stable desert
- May be distal cis-regulatory elements.
31Discussion
- Genome-wide search for conserved elements
- Phylogenetic HMM
- Functional categories of CEs
- HCEs similar to UCEs
- Secondary structure
- Functional enrichment
32Limitations
- CEs across species dependent on calibration
- Coding regions evolve in fundamentally similar
ways - There are differences between groups.
- Sensitivity, specificity depending on number of
species, their phylogeny, amount of missing data. - Phylo-HMMs assumption
- All sites evolve at one of the two evolutionary
rates. - These rates are uniform across the genome.
- Sites evolve independently conditional on C/N.
- Phylogenetic model has same branch length
proportions, base compositions, and subsitution
patterns - Gaps are considered as missing data
- ? Over-simplification.
- Complete dependency on alignment by MULTIZ
33Summary-I
- Conserved elements by Phylo-HMM in four separate
genome-wide alignments - From yeast to vertebrate, increasing fractions of
CEs are outside of known exons or protein-coding
genes ? importance of non-coding sequences - HCEs in vertebrate ? 42, in others
? gt93
34Summary-II
- Some extreme conservation in 3UTR genes
that regulate other genes ? possibly
post-transcriptional regulation - Local secondary structure in 3UTR
- ? possibly post-transcriptional regulation
- Strongly enriched in stable gene desert.
- ? Distal cis-regulatory elements
35