Title: 10/26/05 Promoter Prediction (really!)
110/26/05 Promoter Prediction(really!)
2Announcements
- BCB Link for Seminar Schedules (updated)
- http//www.bcb.iastate.edu/seminars/index.html
- Seminar (Fri Oct 28)
- 1210 PM BCB Faculty Seminar in E164 Lagomarcino
- Assembly and Alignment of Genomic DNA Sequence
Xiaoqiu Huang, ComS - http//www.bcb.iastate.edu/courses/BCB691-F2005.h
tmlOct2028 - Mark your calendars
- 110 PM Nov 14 Baker Seminar in Howe Hall
Auditorium - "Discovering transcription factor binding sites"
- Douglas Brutlag,Dept of Biochemistry Medicine
- Stanford University School of Medicine
3Announcements
BCB 544 Projects - Important Dates Nov 2 Wed
noon - Project proposals due to David/Drena Nov
4 Fri 10A - Approvals/responses to
students Dec 2 Fri noon - Written project
reports due Dec 5,7,8,9 class/lab - Oral
Presentations (20') (Dec 15 Thurs Final
Exam)
4Announcements
Lab 9 - due Wed noon (today) Exam 2 - this
Friday Posted Online Exam 2 Study Guide
544 Reading Assignment (2 papers) Lab
Keys (today) Thurs No Lab - Extra Office Hrs
instead David 1-3 PM in 209
Atanasoff Drena 1-3 PM in 106 MBB
5Promoter Prediction RNA Structure/Function
Prediction
- Mon ? Quite a few more words re
- Gene prediction
- Wed Promoter prediction
- next Mon RNA structure function
- RNA structure prediction
- 2' 3' structure prediction
- miRNA target prediction
6Optional - but very helpful reading
(that's a hint!)
- Zhang MQ (2002) Computational prediction of
eukaryotic protein-coding genes. Nat Rev Genet
3698-709 - http//proxy.lib.iastate.edu2103/nrg/journal/v3/n
9/full/nrg890_fs.html - Wasserman WW Sandelin A (2004) Applied
bioinformatics for identification of regulatory
elements. Nat Rev Genet 5276-287 - http//proxy.lib.iastate.edu2103/nrg/journal/v5/
n4/full/nrg1315_fs.html
Check this out http//www.phylofoot.org/NRG_test
cases/
03489059922
7Reading Assignment (for Mon)
- Mount Bioinformatics
- Chp 8 Prediction of RNA Secondary Structure
- pp. 327-355
- Ck Errata http//www.bioinformaticsonline.org/hel
p/errata2.html - Cates (Online) RNA Secondary Structure Prediction
Module - http//cnx.rice.edu/content/m11065/latest/
8Review last lectureFlowchart for Gene
PredictionPerformance Assessment
MeasuresCorrection re slide 10/24 27
Promoters
9Gene prediction flowchart
Fig 5.15 Baxevanis Ouellette 2005
10Evaluation of Splice Site Prediction
What do measures really mean?
Fig 5.11 Baxevanis Ouellette 2005
11Correction re last lecture GeneSeqer
Performance Graphs
Brendel et al (2004) Bioinformatics 20 1157
12Performance?
?
?
Human GT site
Human AG site
Sn
Sn
?
?
A. thaliana AG site
A. thaliana GT site
Sn
Sn
- Note these are not ROC curves (plots of (1-Sn)
vs Sp) - But plots such as these ( ROCs) much better
than using "single number" to compare
different methods - Both types of plots illustrate trade-off Sn vs
Sp
Brendel 2005
13Fig 2 - Brendel et al (2004) Bioinformatics 20
1157
14Bayes Factor as Decision Criterion
H0 HT
Brendel 2005
15Evaluation of Splice Site Prediction
Brendel 2005
16Careful different definitions for "Specificity"
Brendel definitions
cf. Guig?ó definitions Sn Sensitivity
TP/(TPFN) Sp Specificity TN/(TNFP) Sp- AC
Approximate Coefficient 0.5 x ((TP/(TPFN))
(TP/(TPFP)) (TN/(TNFP)) (TN/(TNFN))) - 1
Other measures? Predictive Values, Correlation
Coefficient
17Best measures for comparing different methods?
- ROC curves (Receiver Operating
Characteristic?!!) - http//www.anaesthetist.com/mnm/stats/roc/
- "The Magnificent ROC" - has fun applets
quotes - "There is no statistical test, however intuitive
and simple, which will not be abused by medical
researchers" - Correlation Coefficient
- (Matthews correlation coefficient (MCC)
- MCC 1 for a perfect prediction
- 0 for a completely random assignment
- -1 for a "perfectly incorrect" prediction
Do not memorize this!
18 PromotersWhat signals are there? Simple
ones in prokaryotes
Brown Fig 9.17
BIOS Scientific Publishers Ltd, 1999
19Prokaryotic promoters
- RNA polymerase complex recognizes promoter
sequences located very close to on 5 side
(upstream) of initiation site - RNA polymerase complex binds directly to these.
with no requirement for transcription factors - Prokaryotic promoter sequences are highly
conserved - -10 region
- -35 region
20What signals are there? Complex ones in
eukaryotes!
Fig 9.13 Mount 2004
21Simpler view of complex promoters in eukaryotes
Fig 5.12 Baxevanis Ouellette 2005
22Eukaryotic genes are transcribed by 3 different
RNA polymerases
Recognize different types of promoters
enhancers
Brown Fig 9.18
BIOS Scientific Publishers Ltd, 1999
23Eukaryotic promoters enhancers
- Promoters located relatively close to
initiation site - (but can be located within gene,
rather than upstream!) - Enhancers also required for regulated
transcription - (these control expression in specific cell
types, developmental stages, in response to
environment) - RNA polymerase complexes do not specifically
recognize promoter sequences directly - Transcription factors bind first and serve as
landmarks for recognition by RNA polymerase
complexes
24Eukaryotic transcription factors
- Transcription factors (TFs) are DNA binding
proteins that also interact with RNA polymerase
complex to activate or repress transcription - TFs contain characteristic DNA binding motifs
- http//www.ncbi.nlm.nih.gov/books/bv.fcgi?r
idgenomes.table.7039 - TFs recognize specific short DNA sequence motifs
transcription factor binding sites - Several databases for these, e.g. TRANSFAC
- http//www.generegulation.com/cgibin/pub/data
bases/transfac
25Zinc finger-containing transcription factors
- Common in eukaryotic proteins
- Estimated 1 of mammalian genes encode
zinc-finger proteins - In C. elegans, there are 500!
- Can be used as highly specific DNA binding
modules
- Potentially valuable tools for directed genome
modification (esp. in plants) human gene
therapy
Brown Fig 9.12
BIOS Scientific Publishers Ltd, 1999
26New Today Promoter Prediction
- Predicting regulatory regions (focus on
promoters) - ? Brief review promoters enhancers
- Predicting promoters eukaryotes vs prokaryotes
- Next week
- RNA structure function
27 Predicting Promoters
- Overview of strategies
- ? What sequence signals can be used?
- What other types of information can be used?
- Algorithms
- Promoter prediction software
- 3 major types
- many, many programs!
28Promoter prediction Eukaryotes vs prokaryotes
Promoter prediction is easier in microbial
genomes Why? Highly conserved Simpler
gene structures More sequenced genomes!
(for comparative approaches) Methods?
Previously, again mostly HMM-based Now
similarity-based. comparative methods because
so many genomes available
29Predicting promoters Steps Strategies
- Closely related to gene prediction!
- Obtain genomic sequence
- Use sequence-similarity based comparison
- (BLAST, MSA) to find related genes
- But "regulatory" regions are much less
well-conserved than coding regions - Locate ORFs
- Identify TSS (if possible!)
- Use promoter prediction programs
- Analyze motifs, etc. in sequence (TRANSFAC)
30Predicting promoters Steps Strategies
- Identify TSS --if possible?
- One of biggest problems is determining exact
TSS! - Not very many full-length cDNAs!
- Good starting point? (human vertebrate genes)
- Use FirstEF
- found within UCSC Genome Browser
- or submit to FirstEF web server
-
Fig 5.10 Baxevanis Ouellette 2005
31Automated promoter prediction strategies
- Pattern-driven algorithms
- Sequence-driven algorithms
- Combined "evidence-based"
- BEST RESULTS? Combined, sequential
32Promoter Prediction Pattern-driven algorithms
- Success depends on availability of collections of
annotated binding sites (TRANSFAC PROMO) - Tend to produce huge numbers of FPs
- Why?
- Binding sites (BS) for specific TFs often
variable - Binding sites are short (typically 5-15 bp)
- Interactions between TFs ( other proteins)
influence affinity specificity of TF binding - One binding site often recognized by multiple BFs
- Biology is complex promoters often specific to
organism/cell/stage/environmental condition
33Promoter Prediction Pattern-driven algorithms
- Solutions to problem of too many FP predictions?
- Take sequence context/biology into account
- Eukaryotes clusters of TFBSs are common
- Prokaryotes knowledge of ? factors helps
- Probability of "real" binding site increases if
annotated transcription start site (TSS) nearby - But What about enhancers? (no TSS nearby!)
- Only a small fraction of TSSs have been
experimentally mapped - Do the wet lab experiments!
- But Promoter-bashing is tedious
34Promoter Prediction Sequence-driven algorithms
- Assumption common functionality can be deduced
from sequence conservation - Alignments of co-regulated genes should highlight
elements involved in regulation - Careful How determine co-regulation?
- Orthologous genes from difference species
- Genes experimentally determined to be
- co-regulated (using microarrays??)
- Comparative promoter prediction
- "Phylogenetic footprinting" - more later.
35Promoter Prediction Sequence-driven algorithms
- Problems
- Need sets of co-regulated genes
- For comparative (phylogenetic) methods
- Must choose appropriate species
- Different genomes evolve at different rates
- Classical alignment methods have trouble with
- translocations, inversions in order of
functional elements - If background conservation of entire region is
highly conserved, comparison is useless - Not enough data (Prokaryotes gtgtgt Eukaryotes)
- Biology is complex many (most?) regulatory
elements are not conserved across species!
36Examples of promoter prediction/characterization
software
Lab used MATCH, MatInspector TRANSFAC MEME
MAST BLAST, etc. Others? FIRST EF Dragon
Promoter Finder (these are links in PPTs) also
see Dragon Genome Explorer (has specialized
promoter software for GC-rich DNA, finding CpG
islands, etc) JASPAR
37TRANSFAC matrix entry for TATA box
- Fields
- Accession ID
- Brief description
- TFs associated with this entry
- Weight matrix
- Number of sites used to build (How many here?)
- Other info
Fig 5.13 Baxevanis Ouellette 2005
38Global alignment of human mouse obese gene
promoters (200 bp upstream from TSS)
Fig 5.14 Baxevanis Ouellette 2005
39Check out optional review try associated
tutorial
- Wasserman WW Sandelin A (2004) Applied
bioinformatics for identification of regulatory
elements. Nat Rev Genet 5276-287 - http//proxy.lib.iastate.edu2103/nrg/journal/v5/
n4/full/nrg1315_fs.html
Check this out http//www.phylofoot.org/NRG_test
cases/
40Annotated lists of promoter databases promoter
prediction software
- URLs from Mount Chp 9, available online
- Table 9.12 http//www.bioinformaticsonline.org/li
nks/ch_09_t_2.html - Table in Wasserman Sandelin Nat Rev Genet
article http//proxy.lib.iastate.edu2103/nrg/jour
nal/v5/n4/full/nrg1315_fs.htm - URLs for Baxevanis Ouellette, Chp 5
- http//www.wiley.com/legacy/products/subject/life
/bioinformatics/ch05.htmlinks - More lists
- http//www.softberry.com/berry.phtml?topicindexg
roupprogramssubgrouppromoter - http//bioinformatics.ubc.ca/resources/links_direc
tory/?subcategory_id104 - http//www3.oup.co.uk/nar/database/subcat/1/4/
41Reading Assignment (for Mon)
- Mount Bioinformatics
- Chp 8 Prediction of RNA Secondary Structure
- pp. 327-355
- Ck Errata http//www.bioinformaticsonline.org/hel
p/errata2.html - Cates (Online) RNA Secondary Structure Prediction
Module - http//cnx.rice.edu/content/m11065/latest/