Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

Description:

Top CEs of 5000 (V,I) or 1000 (W,Y) Similar to UCEs (Bejerano et al. 2004) but ... 1.1% - 1.5% and absent in top 100. Some extreme conservation (3' UTR) ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 36
Provided by: jungu
Category:

less

Transcript and Presenter's Notes

Title: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes


1
Evolutionarily conserved elements in vertebrate,
insect, worm, and yeast genomes
  • Junguk Hur
  • School of Informatics

Adam Siepel et al., Genome Research, 2005 15
1034-1050
2
Keywords
  • Conserved elements
  • phastCons
  • Two-state Phylogenetic Hidden Markov Model
    (Phylo-HMM)
  • Functional categories of bases (elements) (CDS,
    UTR, other mRNA, intron, and unannotated)

3
Background-I
  • Many completed genome sequencing projects.
  • Q) How much of vertebrate genomes are directly
    functional? And exactly which regions?
  • More is known in model organisms but still much
    remains to be learned.
  • Methods for identifying sequences likely to be
    functional are of critical importance.
  • Whats best? Look for conserved sequences across
    species.
  • Due to negative (purifying) selection

4
Background-II
  • About 5 of mammalian genome under purifying
    selection. (Smith et al. 2004 and rodent genome
    projects)
  • 1.5 ? protein-coding genes
  • 3.5 ? functional but not coding proteins (Dark
    Matter)
  • Mostly done by pair-wise alignments and simple,
    percent-identity based method
  • VISTA (Mayor et al. 2000), PipMaker (Schwartz el
    al. 2000), zPicture (Ovcharenko et al. 2004), and
    etc
  • Some by multiple alignments to consider phylogeny
  • Stojanovic et al. 1999 Boffelli et al. 2003
    Margulies et al. 2003 Chapman et al. 2004
    Cooper et al. 2004 Ovcharenko et al. 2005
  • Limits sliding windows, little use of branch
    length of phylogeny (allowing for multiple
    substitution per site)

5
Preview of result by phastCons
Worm
Vertebrate
Yeast
Insect
Vertebrate 38 Insect 3753 Worm
1837 Yeast 4768
6
phastCons
  • To identify conserved elements in multiply
    aligned sequences.
  • Based on phylogenetic hidden Markov model
    (phylo-HMM) ? considering 1) process by which
    nucleotide substitutions occur 2) how this
    process changes from one site to next site

7
Two-State Phylo-HMM
8
Phylo-HMM Model
  • Q 4x4 substitution rate matrix
  • ? background probability of A,G,C,T
  • ? binary tree representing the topology of the
    phylogeny
  • ? branch lengths of the phylogeny
  • ? scaling factor
  • Free parameters to be estimated
  • ?, ? transition probabilities

9
Phylo-HMM
  • Let multiple
    alignments
  • emission
    probability
  • Ancestral bases associated with Xi are unknown
  • Marginal probabilities by summing over all
    ancestral bases
  • Let be a path
    through phylo-HMM
  • must sum over all paths.
  • Joint probability of the data and a specific
    path
  • Expectation maximization

10
Topology of phylogeny
11
Genomes surveyed
  • Five vertebrate species
  • Human, mouse, rat, chicken, and fugu
  • Four insect species
  • Three species of Drosophila and Anopheles gambiae
  • Two worm species
  • Seven yeast species

12
Calibration
  • Constrained the model parameters such that
  • the coverage of known coding regions by predicted
    conserved elements was equivalent.
  • Target coverage of 65 ( 1)
  • Estimated from human/mouse comparison
  • Resulting coverage
  • Worm 56
  • Yeast 68
  • Insect 68

13
phastCon result
14
Categories of elements
15
Vertebrate
CDS 66 coverage (15.5-fold) 5UTR 23
(5.3-fold) 3UTR 18 (4.3-fold) Other mRNA
9.2 (2.1-fold) Other trans 7.5 (1.8-fold)
Intron 3.6 Unannotated 2.7
16
Vertebrate
56 of putative RNA genes ? Reasonably
sensitive for functional noncoding sequences
lt1 of the bases in mammalian ancestral repeats
(ARs) (believed to be neutrally
evolving) ? Suggesting low
false-positive rate for prediction ?
Simulation experiments showed lt0.3 of false
positive rate
17
Insect, Worm, and Yeast
Yeast
Insect
CDS, UTR, other mRNA gt substantially higher
coverage Intron, unannotated gt lower than
expected
Worm
18
Composition of CE by Annotation
From vertebrate to yeast (decreasing order of
genome size, complexity) larger fraction of
CDS, UTR smaller fraction of introns and
unannotated Most conserved in worm yeast
? protein coding vertebrate insect
? non-protein coding Consequence of increasing
gene density? But should not underscore the
functional importance of noncoding regions in
eukaryotes
19
Lengths of conserved elements
  • The lengths of predicted elements for all four
    data sets were approximately geometrically
    distributed,
  • 103.8bp for the vertebrates
  • 120.6bp for the insects
  • 268.8bp for the worms
  • 99.6bp for the yeasts.
  • The differences ? available phylogenetic
    information rather than biological
    characteristics
  • More phylogenetic information at each site
  • detection of shorter elements
  • long elements more likely to be broken up?
    overall decrease in the average lengths of
    predicted elements

20
Lengths of conserved elements
  • Average length in different annotation types
  • ARs lt introns and unannotated lt UTR lt CDS

21
Base-by-base conservation score
  • Log-odds score to each predicted element.
  • UCSC Genome Browser

22
Highly Conserved Elements (HCEs)
  • Top CEs of 5000 (V,I) or 1000 (W,Y)
  • Similar to UCEs (Bejerano et al. 2004) but
  • Longer than UCEs
  • Less extreme sequence conservation
  • 10-fold larger than set of UCEs
  • 80 of Human/Rodent UCEs are included in HCEs
    (the other 20 tend to be short with mean length
    of 231 bp)
  • Vertebrate HCEs 0.14 of human genome
  • Longer on average (781 bp)
  • CDS (22-fold), UTR (11-fold) but 42 are in known
    exons.
  • 19 in known introns
  • 32 in completely unannotated

23
Coverage of vertebrate CE
24
HCEs in other sets of genomes
25
HCEs in 3 UTR of vertebrate genes
  • Unexpectedly large fraction in UTR (overall 5.6)
  • 9.6 in 5K, 12.5 in 1K, 14.3 in top 100.
  • 5 UTR? ? 1.1 -gt 1.5 and absent in top 100
  • Some extreme conservation (3 UTR)
  • DNA- and RNA-binding genes
  • 3 UTRs key role in critical regulatory network
  • Many conserved 3 UTR sequences ? involved in
    post-transcriptional regulation ? sub-cellular
    localization, transcript stability or
    translability.
  • miRNA regulation
  • Regulatoin through antisense transcription

26
Secondary structure in HCEs
  • Several known post-transcriptional regulation
    through stem-loop structure in UTR
  • Tested in UTRs for secondary structure
  • Used phylo-SCFG to calculate FPS (folding
    potential score)

27
FPS (Folding Potential Score)
Intron FPS?? 3 UTR FPS 5 UTR ? Intergenic ?
3UTR ? Many intronic/intergenic HCEs may
function at RNA level
28
Functional Enrichment - vertebrate
  • Trans-dev transcription and development related
  • 3UTR Highly enriched in mRNA metabolism, mRNA
    processing, ubiquitin cycle ?possible role in
    post-transcriptional regulation

29
Functional Enrichment others (CDS)
30
HCEs and gene desert
  • Unusually large intergenic regions 545
    recently analyzed by Ovcharenko et al.
  • Min length 640 kb covering 25 of human genome
  • Low GC content
  • High SNP rate
  • Decreased fractions of conservation
  • Two types
  • Stable desert higher conservation, flaking genes
    enriched in trans-dev. Absence of synteny break
  • Only 12 in intergenic bases. 53 of intergenic
    HCEs
  • Variable desert
  • 30 of intergenic bases. Only 2.2 of intergenic
    HCEs
  • Substantially enriched in stable desert
  • May be distal cis-regulatory elements.

31
Discussion
  • Genome-wide search for conserved elements
  • Phylogenetic HMM
  • Functional categories of CEs
  • HCEs similar to UCEs
  • Secondary structure
  • Functional enrichment

32
Limitations
  • CEs across species dependent on calibration
  • Coding regions evolve in fundamentally similar
    ways
  • There are differences between groups.
  • Sensitivity, specificity depending on number of
    species, their phylogeny, amount of missing data.
  • Phylo-HMMs assumption
  • All sites evolve at one of the two evolutionary
    rates.
  • These rates are uniform across the genome.
  • Sites evolve independently conditional on C/N.
  • Phylogenetic model has same branch length
    proportions, base compositions, and subsitution
    patterns
  • Gaps are considered as missing data
  • ? Over-simplification.
  • Complete dependency on alignment by MULTIZ

33
Summary-I
  • Conserved elements by Phylo-HMM in four separate
    genome-wide alignments
  • From yeast to vertebrate, increasing fractions of
    CEs are outside of known exons or protein-coding
    genes ? importance of non-coding sequences
  • HCEs in vertebrate ? 42, in others
    ? gt93

34
Summary-II
  • Some extreme conservation in 3UTR genes
    that regulate other genes ? possibly
    post-transcriptional regulation
  • Local secondary structure in 3UTR
  • ? possibly post-transcriptional regulation
  • Strongly enriched in stable gene desert.
  • ? Distal cis-regulatory elements

35
  • Thanks
Write a Comment
User Comments (0)
About PowerShow.com