Title: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA)
1Introduction to the Eukaryotic Promoter Database
(EPD) and Signal Search Analysis (SSA)
Workshop on Regulatory Sequence Motif
Discovery, November 10th 2006. The Linnaeus
Centre for Bioinformatics, SLU-UU, Sweden.
- Giovanna Ambrosini Christoph Schmid
2Components of transcriptional regulation
Distal transcription-factor binding sites
(enhancer)
cis-regulatory modules
Wasserman 5, 276-287 (2004)
3EPDThe Eukaryotic Promoter DatabaseCurrent
Release 88 (SEPT-2006)
- founded in 1986 (Bucher and Trifonov Nucleic
Acids Res, 14, 10009-10026) - originally exclusively based on literature,
carefully maintained and regularly updated - in recent years started with consideration of
mass sequencing data - aim at high precision of mapping of
transcription start site (/- 5bp) - promoter sequences of 139 different species,
still relatively low coverage (i.e. 1871 human
entries) - format of annotation of TSS
- DR EMBL ZZ999999.1 HS28BP -19, 9.
- -15 -10 -5 0
5 - ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' ' ' ' ' - a c c c g c c t g c a c c c g a t t c A T G T G
A G A A - one or several alternative transcription start
sites per gene
4EPD format
ID HS_RPS3 standard multiple VRT. XX AC
EP74176 XX DT 10-JAN-2003 (Rel. 73,
created) DT 13-SEP-2004 (Rel. 80, Last
annotation update). XX DE Ribosomal protein
S3. OS Homo sapiens (human). XX HG none. AP
none. NP none. XX DR GENOME NT_033927.7
NT_033927 -5333322, 12577805. ENSEMBL
UCSC HapMap DR CLEANEX HS_RPS3. DR EMBL
AP000744.4 -90138, 35862. EMBL GenBank
DDBJ DR SWISS-PROT P23396 RS3_HUMAN. DR
RefSeq NM_001005 DBTSS . DR MIM 600454.
5TSS determined by modelling Gaussian
distributions (MADAP)
10 bp
Frequency of full-length transcripts
45 bp
Genomic position
R
R
84047148-84047231
84046905-84046987
The Eukaryotic Promoter Database EPD the impact
of in silico primer extension. Schmid, C.D.,
Praz, V., Delorenzi, M., Perier, R. and Bucher,
P. (2004) Nucleic Acids Res, 32, D82-85.
6 -1010 -400400 EPD 70 0.83 1 36 RefSeq
mRNA 0.32 0.95 933 Genome annot.
0.31 0.95 890 DBTSSv1 (human) 0.13 0.68 933
Eponine 0.12 0.46 494
7Superior precision of in silico primer extension
(ISPE)
8New data sources for EPD
ChIP-chip Kim et al. (2005) Nature, 436, 876-880
GEO GSE2672 (remapped!)
virtual counts (2 log ratio)-1
ENSEMBL chro12 6.8 6.94 Mb
9ChIP-chip data with insufficient resolution
FP Hs USP5 R EUNC_000012.10 1 6831557 74339.
10EPD webserver http//www.epd.isb-sib.ch/
- find EPD entry(-ies) using gene symbols,...
- extraction of promoter sequences in user-defined
ranges - direct transfer to Signal Sequence Analysis (SSA)
- download of complete (reference!) promoter sets
http//www.epd.isb-sib.ch/seq_download.html
11SSASignal Search AnalysisGiovanna Ambrosini
ISREC Swiss Institute for Experimental Cancer
Research
- History Signal Search Analysis is a method
developed by P Bucher in the early eighties
(Bucher, P. and Bryan B., E.N. Nucleic Acids
Res, v.12(1 Pt 1) 287305) - Purpose to discover and characterize sequence
motifs that occur at constrained distances from
physiologically defined sites in nucleic acid
sequences. - Signal search analysis programs
- CPR generates a constraint profile for the
neighborhood of a functional site - SList generates lists of over and
under-represented motifs in particular regions
relative to a functional site - OProf generates a signal occurrence profile
for a particular motif - PatOP optimizes a weight matrix description of a
locally over-represented sequence motif - Recent events Adaptation of software to new
environment, SSA web server, application to
promoters and translational start sites
12Locally Over-represented Sequence Motifs
13Definition of a Locally Over-represented Sequence
Motif
A motif which preferentially occurs at a
characteristic distance (range) from a certain
type of functional position Example the
TATA-box is a locally over-represented sequence
motif of the -30 region of eukaryotic POL II
transcription initiation sites
- Components of the formal motif description
- A weight matrix or consensus sequence defining
the motif - A cut-off value determining which subsequence
constitutes a motif match - A preferred region of occurrence defined by 5
and 3 borders relative to a functional site,
e.g. a transcription initiation site
14Locally Over-represented Sequence Motifs
- Input Data Structure
-
- Work data
- Primary experimental data
- (Functional Position Set)
- annotated functional positions in DNA sequences
stored in a database
- A DNA sequence matrix
- a set of fixed-length sequence segments with
an experimentally defined site at a fixed
internal position
15The Motif Search Problem
- For a given DNA sequence matrix
- find locally optimal combination of
- using a given quality criterion
- Quantitative motif description
- Cut-off value
- Region of preferential occurrence
16TATA-box Signal Occurrence Profile for EPD and
ENSEMBL Drosophila Promoters
17CCAAT-box Signal Occurrence Profile for
Vertebrate and ENSEMBL Drosophila Promoters
18SSA webserver http//www.isrec.isb-sib.ch/ssa
- Provides access to precompiled functional
position sets - Collections of transcription initiation sites
(promoters) from eukaryotic species - Collections of translation initiation sites from
large variety of prokaryotic genomes - Provides access to the four signal search
analysis programs
19Application to a bacterial translational control
signal the Shine-Dalgarno ribosome binding-site
motif
- Compare the strength and location of the
Shine-Dalgarno mRNA-rRNA interaction motif in E.
coli and B. subtilis in a qualitative manner. - Result the Shine-Dalgarno interaction motif is
stronger in B. subtilis than in E .coli and
centered about two bases further upstream in the
former species. More than hundred bacterial
genomes are now available to perform this type of
analysis.
20Studying transcription regulatory processes with
specialized bioinformatics resources and example
- Biological question
- Do genes that are generally up-regulated in
cancer cells have different types of
promoters? - Procedure
- Define cancer up- and down-regulated gene sets
using CleanEx - Extract corresponding promoter regions from EPD
- Analyse the signal content of the two promoter
sequence sets using SSA
21Comparative analysis of cancer up- and
down-regulated promoters
- Signals considered
- Initiator preferred position approx. frequency
- Initiator 0 25 - 50
- TATA-box -30 to -25 30
- GC-box -200 to 0 50
- CCAAT-box -200 to -50 20
22Positional distribution of Initiator motif in
cancer up- and down-regulated promoters
23Positional distribution of TATA-boxes in cancer
up- and down-regulated promoters
24Positional distribution of GC-boxes in cancer up-
and down-regulated promoters
25Positional distribution of CCAAT-boxes in cancer
up- and down-regulated promoters
26Comparative analysis of cancer up- and
down-regulated promoters Summary of results
- Signal content
- Initiator Frequency in Frequency in
- cancer-up genes cancer-down genes
- Initiator no change no change
- TATA-box up down
- GC-box no change no change
- CCAAT-box up down
- Next questions
- Are TATA-box and CCAAT-box binding factors
disregulated in cancer cells ? - Or do cancer-specific transcription factors
(binding to adjacent sites) preferentially
interact with - TATA-box and CCAAT-box binding factors?
27 Concluding remarks
- Signal search analysis has played an instrumental
role in the characterization of eukaryotic
promoter elements - The method has originally been developed for the
analysis of eukaryotic promoters but has a much
broader application potential (e.g.
Shine-Dalgarno signal analysis) - Rapidly growing collection of complete genomes
and high-throughput methods for genomic analysis
increase the statistical power to discover new
motifs, or better characterize already known
control signals - Aligning sequence sets with respect to a well
characterized motif might allow the detection of
binding sites of cooperating transcription
factors positionally correlated with the known
motif -
- Confirm or challenge commonly accepted hypotheses
originally derived from small sets