Title: The complexity of the transcriptional landscape of the human genome Roderic Guig
1The complexity of the transcriptional landscape
of the human genomeRoderic GuigóCenter for
Genomic Regulation, Barcelona
2genes and proteins
- One gene, one enzymeBeadle and Tatum
- The Central DogmaFrancis Crick
3The standard model of the eukaryotic gene
most of the transcriptional output of the human
genome is localized in well defined genomic loci,
which encode mRNAs that, when exported into the
cytosol, are translated into proteins
4(No Transcript)
5(No Transcript)
6- 1 of the genome. 44 regions
- target selection. commitee to select sequence
targets - manual targets a lot of information
- radom targets stratified by non exonic
conservation with mouse gene density
7DNase Hypersensitive Sites
DNA Replication
Epigenetic ?
Genes and Transcripts
Cis-regulatory elements (promoters, transcription
factor binding sites)
Long-range regulatory elements (enhancers,
repressors/silencers, insulators)
8(No Transcript)
9gencode encyclopedia of genes and gene variants
- identify all protein coding genes in the ENCODE
regions - identify one complete mRNA sequence for at least
one splice isoform of each protein coding gene. - eventually, identify a number of additional
alternative splice forms.
- Roderic Guigó, IMIM-UPF-CRG
- Stylianos Antonarakis, GeneveAlexandre Reymond
- Ewan Birney, EBI
- Michael Brent, WashU
- Lior Pachter, Berkeley
- Manolis Dermitzkakis, Sanger
- Jennifer Ashurst, Tim Hubbard
10the gencode pipeline
- mapping of known transcripts sequences (ESTs,
cDNAs, proteins) into the human genome - manual curation to resolve conflicting evidence
- additional computational predictions
- experimental verification
- FINAL ANNOTATION
11the gencode pipeline
- mapping of known transcripts sequences (ESTs,
cDNAs, proteins) into the human genome - manual curation to resolve conflicting evidence
- additional computational predictions
- experimental verification
- FINAL ANNOTATION
12The gencode pipelinemanual curation havana
(sanger)experimental verificationgenevabioinfo
rmatics imim
- 2608 transcripts in 487 loci
- 137 transcripts in 53 non-coding loci
- 1097 coding transcripts and 1374 non-coding
transcripts in 434 protein coding loci - most of protein coding loci encode
- a mixture of protein coding and
- non-coding transcripts
13one gene - many proteinsvery complex
transcription units
14Distribution of DNaseI HSs vs. TSS in Different
Gene Annotation Sets
from the ENCODE Chromatin and Replication Group,
John Stamatoyannopoulos
15EGASP05
- the complete annotation of 13 regions was
released in january 30, 2005. - The annotation of the remaining 31 regions was
being obtained, and it was withheld. - gene prediction groups were asked to submit
predictions by april 15, 2005 in the remaining 31
regions. - 18 groups participated, submiting 30 prediction
sets - predictions were compared to the annoations in an
NHGRI sponsored workshop at the Wellcome Trust
Sanger Institute, on may 6 and 7, 2005.
16EGASP05
- two main goals
- to assess how automatic methods are able to
reproduce the (costly) manual/computational/experi
mental gencode annotation - how complete is the gencode annotation. are there
still genes consistenly predicted by
computational methods
17accuracy measures
18accuracy at the coding exon level
evidence-based dual genome ab intio
19 programs are quite good at calling the protein
coding exons (accuracy at 80) Not as good at
calling the transcribed exons), butthe best of
the programs predict correctly only 40 of the
complete transcripts (considering only the coding
fraction)
EGASP05
20many novel exons predicted- 8,634 unique exons
predicted in intergenic regions- we ranked the
exons according to the accuracy of te predicted
programs- tested 238 exon pairs by RT-PCR in 24
tissues- only 7 (less than 3) were confirmed
positive
EGASP05
21(No Transcript)
22DNase Hypersensitive Sites
DNA Replication
Epigenetic ?
Genes and Transcripts
Cis-regulatory elements (promoters, transcription
factor binding sites)
Long-range regulatory elements (enhancers,
repressors/silencers, insulators)
23Genome tiling arrays
Slide from http//signal.salk.edu/msample.html Sal
k Institute Genomic Analysis Laboratory
24TRANSCRIPTION OF PROCESSED POLY A RNA based on
a number of high throughput tecnologies
Total of nucleotides 29,998,060 non repeat masked 14,707,189 Nb of nucleotide covered nucleotides covered
Annotated exons 1,650,821 9.8
transfrag/tar 1,278,588 9.3
CAGE Tags 151,149 0.5
Ditags 24,939 0.1
TOTAL UNIQUE Transcribed Bases 2,355,238 14.7
25Table 1 Summary of Transcriptional Coverage of
ENCODE Regions.
PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS
TotalBases 1 TotalInterro-gatedBases 2 bp inExons 3 () bp inCAGEtags 4 () bpinPET 5 () bp inTF 6 () Total Basesin PT 7 () Basesin PT(ESTsincluded) 8 () BasesinExonsandIntrons 9 () Baseswith5'RACE 10() BasesbetweenPETs 11 () Total Bases 12 ()
TOTAL(interrogated and uninterrogated) 29998060 14707189 49. 1776157 (5.9) 151149 (0.5) 24939 (0.1) 1369611 (4.6) 2519280 (8.4) 4826292 (16.1) 17758738 (59.2) 23318182 (77.7) 19658563 (65.5) 27325931 (91.1)
INTERROGATED 29998060 14707189 49. 1447192 (9.8) 116013 (0.8) 19629 (0.1) 1369304 (9.3) 2163303 (14.7) 3545358 (24.1) 9496360 (64.6) 11763410 (80.0) 9767311 (66.4) 13618240 (92.6)
26Table 1 Summary of Transcriptional Coverage of
ENCODE Regions.
PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS
TotalBases 1 TotalInterro-gatedBases 2 bp inExons 3 () bp inCAGEtags 4 () bpinPET 5 () bp inTF 6 () Total Basesin PT 7 () Basesin PT(ESTsincluded) 8 () BasesinExonsandIntrons 9 () Baseswith5'RACE 10() BasesbetweenPETs 11 () Total Bases 12 ()
TOTAL(interrogated and uninterrogated) 29998060 14707189 49. 1776157 (5.9) 151149 (0.5) 24939 (0.1) 1369611 (4.6) 2519280 (8.4) 4826292 (16.1) 17758738 (59.2) 23318182 (77.7) 19658563 (65.5) 27325931 (91.1)
INTERROGATED 29998060 14707189 49. 1447192 (9.8) 116013 (0.8) 19629 (0.1) 1369304 (9.3) 2163303 (14.7) 3545358 (24.1) 9496360 (64.6) 11763410 (80.0) 9767311 (66.4) 13618240 (92.6)
27Other recent studies
- Many individual studies suggest unanticipated
complexity of the transcriptional map of the
human genome - Kapranov et al. (2007)RNA onto tiling arrays,
novel RNA classes, hundreds of thousands of novel
sites of transcription - Peters et al. (2007)LongSage, evidence for
thousands of novel transcripts - Roma et al. (2007)gene trap sequence tags in
mouse embryonic stem cells, thousands of novel
transcripts - Unneberg and Claverie (2007)interchromosomal
transcript chimerism - Denoeud et al. (2007)RACEarrays. Doubling the
number of annotated exons in protein coding
transcripts, widespread transcript chimerism
28tiling arrays reveal many novel sites of
transcription
TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME
COURSE (data by Tom Gingeras, affymerix)
29characteristics of unannotated transfrags
- short 78bp on average compared with 121 for
exonic transfrags - very gc-rich 56 vs 42 in the background of
unannotated regions - lack splice sites
- no matches to protein or domain databases
- lack of selective constraints
- HOWEVER
- reproducible across cell lines
- support by independent evidence of transcription
(mostly unspliced ESTs). - enriched for RNA structures.
30Other recent studies
- Many individual studies suggest unanticipated
complexity of the transcriptional map of the
human genome - Kapranov et al. (in press)RNA onto tiling
arrays, novel RNA classes, hundreds of thousands
of novel sites of transcription - Peters et al. (2007)LongSage, evidence for
thousands of novel transcripts - Roma et al. (in press)gene trap sequence tags in
mouse embryonic stem cells, thousands of novel
transcripts - Unneberg and Claverie (2007)interchromosomal
transcript chimerism - Denoeud et al. (in press)RACEarrays. Doubling
the number of annotated exons in protein coding
transcripts, widespread transcript chimerism
31Rozowsky et al, 2007
- Novel tar/transfrags are associated to known
genes by identifying novel tars that are
co-expressed with known genes across 11 cell
lines and conditions
32Rozowsky et al., 2007
33Denoeud et al, 2007
- The ENCODE experiments
- 5 RACE on 12 tissues
- primers in internal exons of 399 protein coding
loci - RACE products hybridized into genome tiling
arrays - 4573 race exons detected. 2324 novel
345 RACE from TMEM15 Gene (region Enr232)
identifies several tissue specific distal 5
exons.
Target gene
35(No Transcript)
36 more than 30 of RACEfrags more than 3Mb away
from the index exon
37distal RACEfrags are associated to independently
predictes sites of transcription initiation
38cloning and sequencing of RACEarray products
39cloning and sequencing of RACEarray products
almost 30 of the sequenced products incorporate
exons from upstream genes in chimeric structures
40RACEarrays an strategy for normalization of RACE
libraries, and exhaustive identification of
alternative transcripts
41Array based normalization of RACE libraries
If we select 40 clones at random from the RACE
reaction, the probability of selecting a clone
from the less abundant form is 0.01 (assuming a
multinomial distribution) However, if the
transcript forms could be segregated by RT-PCR,
then by selecting again 40 random clones, 10 from
each RT-PCR, the probability of selecting the
less abundant form is now, 0.6
42RT-PCR cloning and sequencing pilot (Kourosh
Salehi-Ashtiani, DFCI)
- 24 novel RACEfrags tested by RT-PCR, including
6 cases previously confirmed in Denoeud et al.
(2007) - Positive RT-PCR cloned, and 32 randomly
selected clones sequenced.
- RESULTS
- 14 positive RT-PCR, 13 confirmed by sequencing.
- 42 novel transcript variants. Compared with the
52 previously know for the RT-PCR positive loci. - Nearly all canonical splice boundaries
- Genomic extensions from 2.5 to 145Kb
Difficulties in obtaining sequences for long
cDNAs (which correlate with long genomic
extensions)--but even with previously verified
cases. Problems with RACEfrag assigntation to
loci
43A very efficient strategy for targeted large
scale transcript discovery
- RACEarray normalization448 atempted clone
sequences ? 42 novel transcripts - 1 novel transcript per 10 clones sequenced.
- Carnici et al. (Genome Research 2003,
131273-1289)1,989,385 ESTs ? 70,214 transcripts
(mouse) - 1 transcript after 30 sequenced ESTs. (and the
majority of transcripts already known)
44CONCLUSIONS
- there is substantial amount of transcription
which does not appear to be associated to protein
coding loci - only a fraction of the transcript diversity of
protein coding loci appears to have been surveyed
so far. - in particular, protein coding loci appear to have
tissue specific distal alternative
transcriptional start sites - RACEarrays are an effective normalization
strategy for identifciation of rear transcripts - ENCODE transcriptional landscape network of
overlapping coding and non-coding transcripts,
resulting in a continuum of transcription (more
than 90 of the ENCODE regions are transcribed in
at least one strand)
45PROVING THE FUNCTIONALITY OF NOVEL TRANSCRIPTS
46The GENCODE annotation
- 487 loci. 2608 transcripts
- 53 non-coding loci. 137 transcripts
- 434 protein coding loci.
- 1097 coding transcripts
- 1374 non-coding transcripts
- 5.7 transcripts per protein coding locus
- 2.5 coding transcripts per locus
- 1.7 proteins per locus
47the combined analysis of BioSapiens, Kellis and
Goldman identified 184 annotated protein coding
transcripts which challenged (from the
structural, functional and/or evolutionary
standpoint) our current view of
proteins. Footnote removing these 184 proteins
from the set of 738 GENCODE proteins, will leave
554 proteins for 434 loci barely 1,3 proteins
per locus
48Locus RP11-298J23.1 codes for pepsinogen C. The
structure of pepsinogen C is 1htrA. Isoform -003
is missing 80 residues with respect to pepsinogen
C. Here the missing section of -003 is in light
green. The missing section in this isoform would
remove the core from both subdomains of the
structure. Both the N-terminal sub-domain (on the
left) and the C-terminal sub-domain would have to
refold. This is the view from above looking down
into the active cleft of the proteinase. Active
site aspartates are shown in ball and chain. One
of the two active site residues is in the missing
section. The symmetry apparent in this isoform
suggests that although it will have to refold it
may very well be able to reform into a single
subdomain.
Michael Tress Alfonso Valencia CNB, Madrid
49Expression levelsalternative vs constitutive
- Q-PCR in three cell lines
- SKNAS
- GM06990
- HelaS3
50Polysomal associationalternative vs constitutive
51ACKNOWLEDGEMNTS
Jan Korbel (Yale) Julien Lagarde (IMIM) Jeff Long
(Affx) Todd Lowe (UCSC) G. Madhavan (Affx) Anton
Nekrutenko (Penn State) David Nix (Affx) Jakob
Pedersen (UCSC) Alex Reymond Geneva) Joel
Rozowsky (Yale) Yijun Runan (GIS) Albin Sandelin
(RIKEN) Mike Snyder (Yale) Peter F. Stadler (U.
Vienna) Kevin Struhl (Harvard) Hari Tammana
(Affx) Scott Tennenbaun (SUNY, Albany) Chia Lin
Wei (GIS) Matt Weirauch (UCSC) Deyou Zheng
(Yale) Addam Frankish(Sanger) Tom Gingeras
(Affymetrix) Roderic Guigó (CRG)
ENCODE GT GROUP Stilyanos Antonarakis
(Geneva) Robert Baertsch (UCSC) Ian Bell
(Affx) Ewan Birney (EBI) Robert Castelo
(IMIM) Jill Cheng (Affx) Evelyn Cheung
(Affx) Hiram Clawson (UCSC) France Denoeud (IMIM)
Sam Deustch(Geneva) Sujit Dike (Affymetrix) Jorg
Drenkow (Affymetrix) Olof.Emanuelsson (Yale)
Paul Flicek (Sanger) Mark Gerstein (Yale)
Srinka Ghosh (Affx) Jenn Harrow (Sanger) Greg
Helt (Afffx) Ivo Hofacker (U. Vienna) Tim Hubbard
(Sanger) Phil Kapranov (Affx) Damian Keefe (EBI)
52(No Transcript)
53CENTER FOR GENOMIC REGULATION, PRBB, BARCELONA