Title: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting
1Special Topics in GenomicsCis-regulatory
Modules and Phylogenetic Footprinting
2Cis-regulatory Modules and Module Discovery
The slides for module discovery are provided by
Prof. Qing Zhou _at_ UCLA
3Motif Discovery
Mixture modeling
4Difficulties in motif discovery in higher
organisms
- Upstream sequences are longer.
- Motifs are less conserved and shorter.
- Background sequence structures are more
complicated. - To solve the problem, utilize more biological
knowledge in our model. - 1) module structure
- 2) multiple species conservation
5Cis-regulatory module
- Combinatorial control of genes cis-regulatory
modules
6CisModule modeling module structure(Zhou and
Wong, PNAS 2004)
- Module structure consider co-localization of
motif sites.
Hierarchical Mixture modeling ? K of motifs
7Parameters and missing data
- Missing data problem.
- K of motifs
- l Module length
- S Set of sequences
- M Indicators for a module start
- A Indicators for a motif site start
- Background model
- Weight matrices for motifs
- W Motif widths
- r Probability of a module start
- q Probability of starting a motif site
Given
? Observed data
Missing data
Parameters ?
8Bayesian inference by posterior sampling
9Module sampling
- Want to sample from P (M S, ?), need to
calculate
10Module sampling
11Posterior inference
- Motif sites marginal posterior probability of
being a motif start position gt 0.5. - Modules marginal posterior probability of being
within a module gt 0.5.
12Simulation study
- Generate 30 data sets independently, each
contains - 1) 20 sequences, each of length 1000
- 2) 25 modules, with length 150
- 3) each module contains 1 E2F site, 1 YY1 site,
and 1 cMyc site.
CisModule CisModule CisModule Do not consider module Do not consider module Do not consider module
Motifs Fail TP FP Fail TP FP
E2F 0.03 17.9 7.5 0.37 17.1 11.6
YY1 0.07 16.0 8.7 0.20 17.1 11.0
cMyc 0 15.7 9.9 0.63 13.6 12.4
13Example Discovery of tissue-specific modules in
Ciona
- Sidow lab Collected 21 genes that are
co-expressed during the development of muscle
tissue in Ciona. - Want to find motifs and modules in the upstream
sequences (average length 1330) of these genes.
- Found 3 motifs in 28 modules (4860 bps).
Are they real motifs that determine the gene
expression??
14Experimental validation
- Positive element the shortest sufficient and
non-overlapping sequence that drives strong
expression in muscle average length of 289 bps.
15Experimental validation
- 70 of our predicted motif sites are located in
the positive elements!
16Other tools
- Gibbs Module Sampler (Thompson et al. Genome Res.
2004) - EMCMODULE (Gupta and Liu, PNAS, 2005)
17Phylogenetic Footprinting
18Functional elements tend to be conserved across
species
For example, exons are conserved due to the
selection pressure. Introns and intergenic
regions are less likely to be conserved.
19Phylogenetic footprinting
Miller et al. Annu. Rev. Genomics Hum. Genet. 2004
20Incorporating cross-species conservation into
motif discovery
- A threshold method (Wasserman et al. Nature
Genetics, 2000) - STEP1 construct cross-species alignment
- STEP2 compute conservation measure from the
alignment - STEP3 Non-conserved regions are filtered out
- STEP4 Gibbs motif sampler is applied to
conserved regions of the target genome
21Phylogenetic footprinting motif discovery
- CompareProspector (Liu Y. et al. Genome Res.
2004) - STEP1 construct cross-species alignment
- STEP2 compute conservation measure (window
percent identity, WPID) from the alignment - STEP3 multiply the likelihood ratio at a
position by the corresponding WPID, thus
likelihood landscape is changed to favor
conserved sites - STEP4 apply a Gibbs motif sampler based algorithm
22Phylogenetic footprinting motif discovery
- Evolutionary model based approach
- EMnEM (Moses et al. 2004)
- PhyME (Sinha et al. 2004)
- PhyloGibbs (Siddharthan et al. 2005)
- Tree Sampler (Li and Wong, 2005)
-
23Incorporating cross-species conservation into
motif discovery
- PhyloCon(Wang and Stormo, Bioinformatics, 2003)
- STEP 1 construct alignment among orthologous
sequences - STEP 2 convert conserved regions into profiles
- STEP 3 use profiles in the first sequence as
seeds - STEP 4 find matches of each seed in the second
sequence - STEP 5 update seeds
- STEP 6 repeat step 2 and 3 for all sequences.
24Phylogenetic footprinting module discovery
- Multimodule (Zhou and Wong, The Annals of Applied
Statistics, 2007)
25Multimodule
- Module structure of each sequence is modeled by
an HMM. - Couple HMMs via multiple alignment Aligned
states are coupled and collapsed into one common
state. - Uncoupled states similar to single species
model. - Coupled states evolutionary model.
26Comparing with other methods
- Three data sets with experimental validation
reported previously, which contain 9 known motifs
with 152 validated sites. - CompareProspector (Liu et al. 2004) conservation
score - PhyloCon (Wang and Stormo 2003) progressive
alignment of profiles - EMnEM (Moses et al. 2004) Phylogenetic motif
discovery - CisModule (Zhou and Wong 2004) Single-species
module discovery.
27Comparing with other methods
Method known motifs identified For correctly identified motifs by each method For correctly identified motifs by each method For correctly identified motifs by each method For correctly identified motifs by each method
Method known motifs identified predicted sites overlaps Sensitivity () Specificity ()
CompareProspector 7 75 36 24 48
PhyloCon 3 50 26 17 52
EMnEM 6 130 44 29 34
CisModule 5 110 35 23 32
MultiModule 8 157 79 52 50
of known sites 152