The complexity of the transcriptional landscape of the human genome Roderic Guig - PowerPoint PPT Presentation

About This Presentation
Title:

The complexity of the transcriptional landscape of the human genome Roderic Guig

Description:

The complexity of the transcriptional landscape of ... Beadle and Tatum. The Central Dogma. Francis Crick. 8/14/09 ... The standard model of the eukaryotic gene ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 53
Provided by: rode89
Category:

less

Transcript and Presenter's Notes

Title: The complexity of the transcriptional landscape of the human genome Roderic Guig


1
The complexity of the transcriptional landscape
of the human genomeRoderic GuigóCenter for
Genomic Regulation, Barcelona
2
genes and proteins
  • One gene, one enzymeBeadle and Tatum
  • The Central DogmaFrancis Crick

3
The standard model of the eukaryotic gene
most of the transcriptional output of the human
genome is localized in well defined genomic loci,
which encode mRNAs that, when exported into the
cytosol, are translated into proteins
4
(No Transcript)
5
(No Transcript)
6
  • 1 of the genome. 44 regions
  • target selection. commitee to select sequence
    targets
  • manual targets a lot of information
  • radom targets stratified by non exonic
    conservation with mouse gene density

7
DNase Hypersensitive Sites
DNA Replication
Epigenetic ?
Genes and Transcripts
Cis-regulatory elements (promoters, transcription
factor binding sites)
Long-range regulatory elements (enhancers,
repressors/silencers, insulators)
8
(No Transcript)
9
gencode encyclopedia of genes and gene variants
  • identify all protein coding genes in the ENCODE
    regions
  • identify one complete mRNA sequence for at least
    one splice isoform of each protein coding gene.
  • eventually, identify a number of additional
    alternative splice forms.
  • Roderic Guigó, IMIM-UPF-CRG
  • Stylianos Antonarakis, GeneveAlexandre Reymond
  • Ewan Birney, EBI
  • Michael Brent, WashU
  • Lior Pachter, Berkeley
  • Manolis Dermitzkakis, Sanger
  • Jennifer Ashurst, Tim Hubbard

10
the gencode pipeline
  1. mapping of known transcripts sequences (ESTs,
    cDNAs, proteins) into the human genome
  2. manual curation to resolve conflicting evidence
  3. additional computational predictions
  4. experimental verification
  5. FINAL ANNOTATION

11
the gencode pipeline
  1. mapping of known transcripts sequences (ESTs,
    cDNAs, proteins) into the human genome
  2. manual curation to resolve conflicting evidence
  3. additional computational predictions
  4. experimental verification
  5. FINAL ANNOTATION

12
The gencode pipelinemanual curation havana
(sanger)experimental verificationgenevabioinfo
rmatics imim
  • 2608 transcripts in 487 loci
  • 137 transcripts in 53 non-coding loci
  • 1097 coding transcripts and 1374 non-coding
    transcripts in 434 protein coding loci
  • most of protein coding loci encode
  • a mixture of protein coding and
  • non-coding transcripts

13
one gene - many proteinsvery complex
transcription units
14
Distribution of DNaseI HSs vs. TSS in Different
Gene Annotation Sets
from the ENCODE Chromatin and Replication Group,
John Stamatoyannopoulos
15
EGASP05
  • the complete annotation of 13 regions was
    released in january 30, 2005.
  • The annotation of the remaining 31 regions was
    being obtained, and it was withheld.
  • gene prediction groups were asked to submit
    predictions by april 15, 2005 in the remaining 31
    regions.
  • 18 groups participated, submiting 30 prediction
    sets
  • predictions were compared to the annoations in an
    NHGRI sponsored workshop at the Wellcome Trust
    Sanger Institute, on may 6 and 7, 2005.

16
EGASP05
  • two main goals
  • to assess how automatic methods are able to
    reproduce the (costly) manual/computational/experi
    mental gencode annotation
  • how complete is the gencode annotation. are there
    still genes consistenly predicted by
    computational methods

17
accuracy measures
18
accuracy at the coding exon level
evidence-based dual genome ab intio
19
programs are quite good at calling the protein
coding exons (accuracy at 80) Not as good at
calling the transcribed exons), butthe best of
the programs predict correctly only 40 of the
complete transcripts (considering only the coding
fraction)
EGASP05
20
many novel exons predicted- 8,634 unique exons
predicted in intergenic regions- we ranked the
exons according to the accuracy of te predicted
programs- tested 238 exon pairs by RT-PCR in 24
tissues- only 7 (less than 3) were confirmed
positive
EGASP05
21
(No Transcript)
22
DNase Hypersensitive Sites
DNA Replication
Epigenetic ?
Genes and Transcripts
Cis-regulatory elements (promoters, transcription
factor binding sites)
Long-range regulatory elements (enhancers,
repressors/silencers, insulators)
23
Genome tiling arrays
Slide from http//signal.salk.edu/msample.html Sal
k Institute Genomic Analysis Laboratory
24
TRANSCRIPTION OF PROCESSED POLY A RNA based on
a number of high throughput tecnologies
Total of nucleotides 29,998,060 non repeat masked 14,707,189 Nb of nucleotide covered nucleotides covered
Annotated exons 1,650,821 9.8
transfrag/tar 1,278,588 9.3
CAGE Tags 151,149 0.5
Ditags 24,939 0.1
TOTAL UNIQUE Transcribed Bases 2,355,238 14.7
25
Table 1 Summary of Transcriptional Coverage of
ENCODE Regions.
PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS
TotalBases 1 TotalInterro-gatedBases 2 bp inExons 3 () bp inCAGEtags 4 () bpinPET 5 () bp inTF 6 () Total Basesin PT 7 () Basesin PT(ESTsincluded) 8 () BasesinExonsandIntrons 9 () Baseswith5'RACE 10() BasesbetweenPETs 11 () Total Bases 12 ()  
TOTAL(interrogated and uninterrogated) 29998060 14707189 49. 1776157 (5.9) 151149 (0.5) 24939 (0.1) 1369611 (4.6) 2519280 (8.4) 4826292 (16.1) 17758738 (59.2) 23318182 (77.7) 19658563 (65.5) 27325931 (91.1)  
INTERROGATED 29998060 14707189 49. 1447192 (9.8) 116013 (0.8) 19629 (0.1) 1369304 (9.3) 2163303 (14.7) 3545358 (24.1) 9496360 (64.6) 11763410 (80.0) 9767311 (66.4) 13618240 (92.6)  
26
Table 1 Summary of Transcriptional Coverage of
ENCODE Regions.
PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PROCESSED TRANSCRIPTS (PT) PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS PRIMARY TRANSCRIPTS
TotalBases 1 TotalInterro-gatedBases 2 bp inExons 3 () bp inCAGEtags 4 () bpinPET 5 () bp inTF 6 () Total Basesin PT 7 () Basesin PT(ESTsincluded) 8 () BasesinExonsandIntrons 9 () Baseswith5'RACE 10() BasesbetweenPETs 11 () Total Bases 12 ()  
TOTAL(interrogated and uninterrogated) 29998060 14707189 49. 1776157 (5.9) 151149 (0.5) 24939 (0.1) 1369611 (4.6) 2519280 (8.4) 4826292 (16.1) 17758738 (59.2) 23318182 (77.7) 19658563 (65.5) 27325931 (91.1)  
INTERROGATED 29998060 14707189 49. 1447192 (9.8) 116013 (0.8) 19629 (0.1) 1369304 (9.3) 2163303 (14.7) 3545358 (24.1) 9496360 (64.6) 11763410 (80.0) 9767311 (66.4) 13618240 (92.6)  
27
Other recent studies
  • Many individual studies suggest unanticipated
    complexity of the transcriptional map of the
    human genome
  • Kapranov et al. (2007)RNA onto tiling arrays,
    novel RNA classes, hundreds of thousands of novel
    sites of transcription
  • Peters et al. (2007)LongSage, evidence for
    thousands of novel transcripts
  • Roma et al. (2007)gene trap sequence tags in
    mouse embryonic stem cells, thousands of novel
    transcripts
  • Unneberg and Claverie (2007)interchromosomal
    transcript chimerism
  • Denoeud et al. (2007)RACEarrays. Doubling the
    number of annotated exons in protein coding
    transcripts, widespread transcript chimerism

28
tiling arrays reveal many novel sites of
transcription
TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME
COURSE (data by Tom Gingeras, affymerix)
29
characteristics of unannotated transfrags
  • short 78bp on average compared with 121 for
    exonic transfrags
  • very gc-rich 56 vs 42 in the background of
    unannotated regions
  • lack splice sites
  • no matches to protein or domain databases
  • lack of selective constraints
  • HOWEVER
  • reproducible across cell lines
  • support by independent evidence of transcription
    (mostly unspliced ESTs).
  • enriched for RNA structures.

30
Other recent studies
  • Many individual studies suggest unanticipated
    complexity of the transcriptional map of the
    human genome
  • Kapranov et al. (in press)RNA onto tiling
    arrays, novel RNA classes, hundreds of thousands
    of novel sites of transcription
  • Peters et al. (2007)LongSage, evidence for
    thousands of novel transcripts
  • Roma et al. (in press)gene trap sequence tags in
    mouse embryonic stem cells, thousands of novel
    transcripts
  • Unneberg and Claverie (2007)interchromosomal
    transcript chimerism
  • Denoeud et al. (in press)RACEarrays. Doubling
    the number of annotated exons in protein coding
    transcripts, widespread transcript chimerism

31
Rozowsky et al, 2007
  • Novel tar/transfrags are associated to known
    genes by identifying novel tars that are
    co-expressed with known genes across 11 cell
    lines and conditions

32
Rozowsky et al., 2007
33
Denoeud et al, 2007
  • The ENCODE experiments
  • 5 RACE on 12 tissues
  • primers in internal exons of 399 protein coding
    loci
  • RACE products hybridized into genome tiling
    arrays
  • 4573 race exons detected. 2324 novel

34
5 RACE from TMEM15 Gene (region Enr232)
identifies several tissue specific distal 5
exons.
Target gene
35
(No Transcript)
36
more than 30 of RACEfrags more than 3Mb away
from the index exon
37
distal RACEfrags are associated to independently
predictes sites of transcription initiation
38
cloning and sequencing of RACEarray products
39
cloning and sequencing of RACEarray products
almost 30 of the sequenced products incorporate
exons from upstream genes in chimeric structures
40
RACEarrays an strategy for normalization of RACE
libraries, and exhaustive identification of
alternative transcripts
41
Array based normalization of RACE libraries
If we select 40 clones at random from the RACE
reaction, the probability of selecting a clone
from the less abundant form is 0.01 (assuming a
multinomial distribution) However, if the
transcript forms could be segregated by RT-PCR,
then by selecting again 40 random clones, 10 from
each RT-PCR, the probability of selecting the
less abundant form is now, 0.6
42
RT-PCR cloning and sequencing pilot (Kourosh
Salehi-Ashtiani, DFCI)
























  • 24 novel RACEfrags tested by RT-PCR, including
    6 cases previously confirmed in Denoeud et al.
    (2007)
  • Positive RT-PCR cloned, and 32 randomly
    selected clones sequenced.
  • RESULTS
  • 14 positive RT-PCR, 13 confirmed by sequencing.
  • 42 novel transcript variants. Compared with the
    52 previously know for the RT-PCR positive loci.
  • Nearly all canonical splice boundaries
  • Genomic extensions from 2.5 to 145Kb






















Difficulties in obtaining sequences for long
cDNAs (which correlate with long genomic
extensions)--but even with previously verified
cases. Problems with RACEfrag assigntation to
loci
43
A very efficient strategy for targeted large
scale transcript discovery
  • RACEarray normalization448 atempted clone
    sequences ? 42 novel transcripts
  • 1 novel transcript per 10 clones sequenced.
  • Carnici et al. (Genome Research 2003,
    131273-1289)1,989,385 ESTs ? 70,214 transcripts
    (mouse)
  • 1 transcript after 30 sequenced ESTs. (and the
    majority of transcripts already known)

44
CONCLUSIONS
  • there is substantial amount of transcription
    which does not appear to be associated to protein
    coding loci
  • only a fraction of the transcript diversity of
    protein coding loci appears to have been surveyed
    so far.
  • in particular, protein coding loci appear to have
    tissue specific distal alternative
    transcriptional start sites
  • RACEarrays are an effective normalization
    strategy for identifciation of rear transcripts
  • ENCODE transcriptional landscape network of
    overlapping coding and non-coding transcripts,
    resulting in a continuum of transcription (more
    than 90 of the ENCODE regions are transcribed in
    at least one strand)

45
PROVING THE FUNCTIONALITY OF NOVEL TRANSCRIPTS
46
The GENCODE annotation
  • 487 loci. 2608 transcripts
  • 53 non-coding loci. 137 transcripts
  • 434 protein coding loci.
  • 1097 coding transcripts
  • 1374 non-coding transcripts
  • 5.7 transcripts per protein coding locus
  • 2.5 coding transcripts per locus
  • 1.7 proteins per locus

47
the combined analysis of BioSapiens, Kellis and
Goldman identified 184 annotated protein coding
transcripts which challenged (from the
structural, functional and/or evolutionary
standpoint) our current view of
proteins. Footnote removing these 184 proteins
from the set of 738 GENCODE proteins, will leave
554 proteins for 434 loci barely 1,3 proteins
per locus
48
Locus RP11-298J23.1 codes for pepsinogen C. The
structure of pepsinogen C is 1htrA. Isoform -003
is missing 80 residues with respect to pepsinogen
C. Here the missing section of -003 is in light
green. The missing section in this isoform would
remove the core from both subdomains of the
structure. Both the N-terminal sub-domain (on the
left) and the C-terminal sub-domain would have to
refold. This is the view from above looking down
into the active cleft of the proteinase. Active
site aspartates are shown in ball and chain. One
of the two active site residues is in the missing
section. The symmetry apparent in this isoform
suggests that although it will have to refold it
may very well be able to reform into a single
subdomain.
Michael Tress Alfonso Valencia CNB, Madrid
49
Expression levelsalternative vs constitutive
  • Q-PCR in three cell lines
  • SKNAS
  • GM06990
  • HelaS3

50
Polysomal associationalternative vs constitutive
51
ACKNOWLEDGEMNTS
Jan Korbel (Yale) Julien Lagarde (IMIM) Jeff Long
(Affx) Todd Lowe (UCSC) G. Madhavan (Affx) Anton
Nekrutenko (Penn State) David Nix (Affx) Jakob
Pedersen (UCSC) Alex Reymond Geneva) Joel
Rozowsky (Yale) Yijun Runan (GIS) Albin Sandelin
(RIKEN) Mike Snyder (Yale) Peter F. Stadler (U.
Vienna) Kevin Struhl (Harvard) Hari Tammana
(Affx) Scott Tennenbaun (SUNY, Albany) Chia Lin
Wei (GIS) Matt Weirauch (UCSC) Deyou Zheng
(Yale) Addam Frankish(Sanger) Tom Gingeras
(Affymetrix) Roderic Guigó (CRG)
ENCODE GT GROUP Stilyanos Antonarakis
(Geneva) Robert Baertsch (UCSC) Ian Bell
(Affx) Ewan Birney (EBI) Robert Castelo
(IMIM) Jill Cheng (Affx) Evelyn Cheung
(Affx) Hiram Clawson (UCSC) France Denoeud (IMIM)
Sam Deustch(Geneva) Sujit Dike (Affymetrix) Jorg
Drenkow (Affymetrix) Olof.Emanuelsson (Yale)
Paul Flicek (Sanger) Mark Gerstein (Yale)
Srinka Ghosh (Affx) Jenn Harrow (Sanger) Greg
Helt (Afffx) Ivo Hofacker (U. Vienna) Tim Hubbard
(Sanger) Phil Kapranov (Affx) Damian Keefe (EBI)



52
(No Transcript)
53
CENTER FOR GENOMIC REGULATION, PRBB, BARCELONA
Write a Comment
User Comments (0)
About PowerShow.com