The complexity of the transcriptional landscape of the human genome Roderic Guig - PowerPoint PPT Presentation

About This Presentation

Title:

The complexity of the transcriptional landscape of the human genome Roderic Guig

Description:

The complexity of the transcriptional landscape of ... Beadle and Tatum. The Central Dogma. Francis Crick. 8/14/09 ... The standard model of the eukaryotic gene ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 53

Provided by: rode89

Learn more at: http://incob.apbionet.org

Category:

more less

Transcript and Presenter's Notes

Title: The complexity of the transcriptional landscape of the human genome Roderic Guig

1
The complexity of the transcriptional landscape
of the human genomeRoderic GuigóCenter for
Genomic Regulation, Barcelona
2
genes and proteins

One gene, one enzymeBeadle and Tatum

The Central DogmaFrancis Crick

3
The standard model of the eukaryotic gene
most of the transcriptional output of the human
genome is localized in well defined genomic loci,
which encode mRNAs that, when exported into the
cytosol, are translated into proteins
4
(No Transcript)
5
(No Transcript)
6

1 of the genome. 44 regions
target selection. commitee to select sequence
targets
manual targets a lot of information
radom targets stratified by non exonic
conservation with mouse gene density

7
DNase Hypersensitive Sites
DNA Replication
Epigenetic ?
Genes and Transcripts
Cis-regulatory elements (promoters, transcription
factor binding sites)
Long-range regulatory elements (enhancers,
repressors/silencers, insulators)
8
(No Transcript)
9
gencode encyclopedia of genes and gene variants

identify all protein coding genes in the ENCODE
regions
identify one complete mRNA sequence for at least
one splice isoform of each protein coding gene.
eventually, identify a number of additional
alternative splice forms.

Roderic Guigó, IMIM-UPF-CRG
Stylianos Antonarakis, GeneveAlexandre Reymond
Ewan Birney, EBI
Michael Brent, WashU
Lior Pachter, Berkeley
Manolis Dermitzkakis, Sanger
Jennifer Ashurst, Tim Hubbard

10
the gencode pipeline

mapping of known transcripts sequences (ESTs,
cDNAs, proteins) into the human genome
manual curation to resolve conflicting evidence
additional computational predictions
experimental verification
FINAL ANNOTATION

11
the gencode pipeline

mapping of known transcripts sequences (ESTs,
cDNAs, proteins) into the human genome
manual curation to resolve conflicting evidence
additional computational predictions
experimental verification
FINAL ANNOTATION

12
The gencode pipelinemanual curation havana
(sanger)experimental verificationgenevabioinfo
rmatics imim

2608 transcripts in 487 loci
137 transcripts in 53 non-coding loci
1097 coding transcripts and 1374 non-coding
transcripts in 434 protein coding loci
most of protein coding loci encode
a mixture of protein coding and
non-coding transcripts

13
one gene - many proteinsvery complex
transcription units
14
Distribution of DNaseI HSs vs. TSS in Different
Gene Annotation Sets
from the ENCODE Chromatin and Replication Group,
John Stamatoyannopoulos
15
EGASP05

the complete annotation of 13 regions was
released in january 30, 2005.
The annotation of the remaining 31 regions was
being obtained, and it was withheld.
gene prediction groups were asked to submit
predictions by april 15, 2005 in the remaining 31
regions.
18 groups participated, submiting 30 prediction
sets
predictions were compared to the annoations in an
NHGRI sponsored workshop at the Wellcome Trust
Sanger Institute, on may 6 and 7, 2005.

16
EGASP05

two main goals
to assess how automatic methods are able to
reproduce the (costly) manual/computational/experi
mental gencode annotation
how complete is the gencode annotation. are there
still genes consistenly predicted by
computational methods

17
accuracy measures
18
accuracy at the coding exon level
evidence-based dual genome ab intio
19
programs are quite good at calling the protein
coding exons (accuracy at 80) Not as good at
calling the transcribed exons), butthe best of
the programs predict correctly only 40 of the
complete transcripts (considering only the coding
fraction)
EGASP05
20
many novel exons predicted- 8,634 unique exons
predicted in intergenic regions- we ranked the
exons according to the accuracy of te predicted
programs- tested 238 exon pairs by RT-PCR in 24
tissues- only 7 (less than 3) were confirmed
positive
EGASP05
21
(No Transcript)
22
DNase Hypersensitive Sites
DNA Replication
Epigenetic ?
Genes and Transcripts
Cis-regulatory elements (promoters, transcription
factor binding sites)
Long-range regulatory elements (enhancers,
repressors/silencers, insulators)
23
Genome tiling arrays
Slide from http//signal.salk.edu/msample.html Sal
k Institute Genomic Analysis Laboratory
24
TRANSCRIPTION OF PROCESSED POLY A RNA based on
a number of high throughput tecnologies
Total of nucleotides 29,998,060 non repeat masked 14,707,189 Nb of nucleotide covered nucleotides covered
Annotated exons 1,650,821 9.8
transfrag/tar 1,278,588 9.3
CAGE Tags 151,149 0.5
Ditags 24,939 0.1
TOTAL UNIQUE Transcribed Bases 2,355,238 14.7
25
Table 1 Summary of Transcriptional Coverage of
ENCODE Regions.
PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS
TotalBases 1 TotalInterro-gatedBases 2 bp inExons 3 () bp inCAGEtags 4 () bpinPET 5 () bp inTF 6 () Total Basesin PT 7 () Basesin PT(ESTsincluded) 8 () BasesinExonsandIntrons 9 () Baseswith5'RACE 10() BasesbetweenPETs 11 () Total Bases 12 ()
TOTAL(interrogated and uninterrogated) 29998060 14707189 49. 1776157 (5.9) 151149 (0.5) 24939 (0.1) 1369611 (4.6) 2519280 (8.4) 4826292 (16.1) 17758738 (59.2) 23318182 (77.7) 19658563 (65.5) 27325931 (91.1)
INTERROGATED 29998060 14707189 49. 1447192 (9.8) 116013 (0.8) 19629 (0.1) 1369304 (9.3) 2163303 (14.7) 3545358 (24.1) 9496360 (64.6) 11763410 (80.0) 9767311 (66.4) 13618240 (92.6)
26
Table 1 Summary of Transcriptional Coverage of
ENCODE Regions.
PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS
TotalBases 1 TotalInterro-gatedBases 2 bp inExons 3 () bp inCAGEtags 4 () bpinPET 5 () bp inTF 6 () Total Basesin PT 7 () Basesin PT(ESTsincluded) 8 () BasesinExonsandIntrons 9 () Baseswith5'RACE 10() BasesbetweenPETs 11 () Total Bases 12 ()
TOTAL(interrogated and uninterrogated) 29998060 14707189 49. 1776157 (5.9) 151149 (0.5) 24939 (0.1) 1369611 (4.6) 2519280 (8.4) 4826292 (16.1) 17758738 (59.2) 23318182 (77.7) 19658563 (65.5) 27325931 (91.1)
INTERROGATED 29998060 14707189 49. 1447192 (9.8) 116013 (0.8) 19629 (0.1) 1369304 (9.3) 2163303 (14.7) 3545358 (24.1) 9496360 (64.6) 11763410 (80.0) 9767311 (66.4) 13618240 (92.6)
27
Other recent studies

Many individual studies suggest unanticipated
complexity of the transcriptional map of the
human genome
Kapranov et al. (2007)RNA onto tiling arrays,
novel RNA classes, hundreds of thousands of novel
sites of transcription
Peters et al. (2007)LongSage, evidence for
thousands of novel transcripts
Roma et al. (2007)gene trap sequence tags in
mouse embryonic stem cells, thousands of novel
transcripts
Unneberg and Claverie (2007)interchromosomal
transcript chimerism
Denoeud et al. (2007)RACEarrays. Doubling the
number of annotated exons in protein coding
transcripts, widespread transcript chimerism

28
tiling arrays reveal many novel sites of
transcription
TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME
COURSE (data by Tom Gingeras, affymerix)
29
characteristics of unannotated transfrags

short 78bp on average compared with 121 for
exonic transfrags
very gc-rich 56 vs 42 in the background of
unannotated regions
lack splice sites
no matches to protein or domain databases
lack of selective constraints
HOWEVER
reproducible across cell lines
support by independent evidence of transcription
(mostly unspliced ESTs).
enriched for RNA structures.

30
Other recent studies

Many individual studies suggest unanticipated
complexity of the transcriptional map of the
human genome
Kapranov et al. (in press)RNA onto tiling
arrays, novel RNA classes, hundreds of thousands
of novel sites of transcription
Peters et al. (2007)LongSage, evidence for
thousands of novel transcripts
Roma et al. (in press)gene trap sequence tags in
mouse embryonic stem cells, thousands of novel
transcripts
Unneberg and Claverie (2007)interchromosomal
transcript chimerism
Denoeud et al. (in press)RACEarrays. Doubling
the number of annotated exons in protein coding
transcripts, widespread transcript chimerism

31
Rozowsky et al, 2007

Novel tar/transfrags are associated to known
genes by identifying novel tars that are
co-expressed with known genes across 11 cell
lines and conditions

32
Rozowsky et al., 2007
33
Denoeud et al, 2007

The ENCODE experiments
5 RACE on 12 tissues
primers in internal exons of 399 protein coding
loci
RACE products hybridized into genome tiling
arrays
4573 race exons detected. 2324 novel

34
5 RACE from TMEM15 Gene (region Enr232)
identifies several tissue specific distal 5
exons.
Target gene
35
(No Transcript)
36
more than 30 of RACEfrags more than 3Mb away
from the index exon
37
distal RACEfrags are associated to independently
predictes sites of transcription initiation
38
cloning and sequencing of RACEarray products
39
cloning and sequencing of RACEarray products
almost 30 of the sequenced products incorporate
exons from upstream genes in chimeric structures
40
RACEarrays an strategy for normalization of RACE
libraries, and exhaustive identification of
alternative transcripts
41
Array based normalization of RACE libraries
If we select 40 clones at random from the RACE
reaction, the probability of selecting a clone
from the less abundant form is 0.01 (assuming a
multinomial distribution) However, if the
transcript forms could be segregated by RT-PCR,
then by selecting again 40 random clones, 10 from
each RT-PCR, the probability of selecting the
less abundant form is now, 0.6
42
RT-PCR cloning and sequencing pilot (Kourosh
Salehi-Ashtiani, DFCI)

24 novel RACEfrags tested by RT-PCR, including
6 cases previously confirmed in Denoeud et al.
(2007)
Positive RT-PCR cloned, and 32 randomly
selected clones sequenced.

RESULTS
14 positive RT-PCR, 13 confirmed by sequencing.
42 novel transcript variants. Compared with the
52 previously know for the RT-PCR positive loci.
Nearly all canonical splice boundaries
Genomic extensions from 2.5 to 145Kb

Difficulties in obtaining sequences for long
cDNAs (which correlate with long genomic
extensions)--but even with previously verified
cases. Problems with RACEfrag assigntation to
loci
43
A very efficient strategy for targeted large
scale transcript discovery

RACEarray normalization448 atempted clone
sequences ? 42 novel transcripts
1 novel transcript per 10 clones sequenced.
Carnici et al. (Genome Research 2003,
131273-1289)1,989,385 ESTs ? 70,214 transcripts
(mouse)
1 transcript after 30 sequenced ESTs. (and the
majority of transcripts already known)

44
CONCLUSIONS

there is substantial amount of transcription
which does not appear to be associated to protein
coding loci
only a fraction of the transcript diversity of
protein coding loci appears to have been surveyed
so far.
in particular, protein coding loci appear to have
tissue specific distal alternative
transcriptional start sites
RACEarrays are an effective normalization
strategy for identifciation of rear transcripts
ENCODE transcriptional landscape network of
overlapping coding and non-coding transcripts,
resulting in a continuum of transcription (more
than 90 of the ENCODE regions are transcribed in
at least one strand)

45
PROVING THE FUNCTIONALITY OF NOVEL TRANSCRIPTS
46
The GENCODE annotation

487 loci. 2608 transcripts
53 non-coding loci. 137 transcripts
434 protein coding loci.
1097 coding transcripts
1374 non-coding transcripts
5.7 transcripts per protein coding locus
2.5 coding transcripts per locus
1.7 proteins per locus

47
the combined analysis of BioSapiens, Kellis and
Goldman identified 184 annotated protein coding
transcripts which challenged (from the
structural, functional and/or evolutionary
standpoint) our current view of
proteins. Footnote removing these 184 proteins
from the set of 738 GENCODE proteins, will leave
554 proteins for 434 loci barely 1,3 proteins
per locus
48
Locus RP11-298J23.1 codes for pepsinogen C. The
structure of pepsinogen C is 1htrA. Isoform -003
is missing 80 residues with respect to pepsinogen
C. Here the missing section of -003 is in light
green. The missing section in this isoform would
remove the core from both subdomains of the
structure. Both the N-terminal sub-domain (on the
left) and the C-terminal sub-domain would have to
refold. This is the view from above looking down
into the active cleft of the proteinase. Active
site aspartates are shown in ball and chain. One
of the two active site residues is in the missing
section. The symmetry apparent in this isoform
suggests that although it will have to refold it
may very well be able to reform into a single
subdomain.
Michael Tress Alfonso Valencia CNB, Madrid
49
Expression levelsalternative vs constitutive

Q-PCR in three cell lines
SKNAS
GM06990
HelaS3

50
Polysomal associationalternative vs constitutive
51
ACKNOWLEDGEMNTS
Jan Korbel (Yale) Julien Lagarde (IMIM) Jeff Long
(Affx) Todd Lowe (UCSC) G. Madhavan (Affx) Anton
Nekrutenko (Penn State) David Nix (Affx) Jakob
Pedersen (UCSC) Alex Reymond Geneva) Joel
Rozowsky (Yale) Yijun Runan (GIS) Albin Sandelin
(RIKEN) Mike Snyder (Yale) Peter F. Stadler (U.
Vienna) Kevin Struhl (Harvard) Hari Tammana
(Affx) Scott Tennenbaun (SUNY, Albany) Chia Lin
Wei (GIS) Matt Weirauch (UCSC) Deyou Zheng
(Yale) Addam Frankish(Sanger) Tom Gingeras
(Affymetrix) Roderic Guigó (CRG)
ENCODE GT GROUP Stilyanos Antonarakis
(Geneva) Robert Baertsch (UCSC) Ian Bell
(Affx) Ewan Birney (EBI) Robert Castelo
(IMIM) Jill Cheng (Affx) Evelyn Cheung
(Affx) Hiram Clawson (UCSC) France Denoeud (IMIM)
Sam Deustch(Geneva) Sujit Dike (Affymetrix) Jorg
Drenkow (Affymetrix) Olof.Emanuelsson (Yale)
Paul Flicek (Sanger) Mark Gerstein (Yale)
Srinka Ghosh (Affx) Jenn Harrow (Sanger) Greg
Helt (Afffx) Ivo Hofacker (U. Vienna) Tim Hubbard
(Sanger) Phil Kapranov (Affx) Damian Keefe (EBI)

52
(No Transcript)
53
CENTER FOR GENOMIC REGULATION, PRBB, BARCELONA

Write a Comment

User Comments (0)