Alternative Splicing from ESTs - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Alternative Splicing from ESTs

Description:

Xenopus laevis (African clawed frog) 359,901. dbEST release 20 February 2004. EST lengths ... (South African Bioinformatics Institute) UniGene. Species UniGene ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 70
Provided by: eduard84
Category:

less

Transcript and Presenter's Notes

Title: Alternative Splicing from ESTs


1
Alternative Splicing from ESTs
  • Eduardo Eyras
  • Bioinformatics UPF February 2004

2
  • Intro
  • ESTs
  • Prediction of
  • Alternative Splicing from ESTs

3
5
3
3
5
AAAAAAA
5 CAP
4
5
3
3
5
Transcription
exons
introns
pre-mRNA
AAAAAAA
5 CAP
5
Alt splicing as a mechanism of gene regulation
Functional domains can be added/subtracted ?
protein diversity Can introduce early stop
codons, resulting in truncated proteins or
unstable mRNAs It can modify the activity of the
transcription factors, affecting the expression
of genes It is observed nearly in all
metazoans Estimated to occur in 30-40 of human
6
Forms of alternative splicing
Exon skipping / inclusion
Alternative 3 splice site
Alternative 5 splice site
Mutually exclusive exons
Intron retention
Constitutive exon
Alternatively spliced exons
7
  • How to study alternative splicing?

8
ESTs (Expressed Sequence Tags)
Single-pass sequencing of a small (end) piece of
cDNA Typically 200-500 nucleotides long It may
contain coding and/or non-coding region
9
ESTs
Cells from a specific organ, tissue or
developmental stage
mRNA extraction
Add oligo-dT primer
TTTTTT
5
3
Reverse transcriptase
RNA
TTTTTT
5
3
DNA
Ribonuclease H
TTTTTT
5
3
DNA polimerase Ribonuclease H
3
5
AAAAAA
TTTTTT
Double stranded cDNA
5
3
10
ESTs
3
5
AAAAAA
Clone cDNA into a vector
TTTTTT
5
3
5 EST
Single-pass sequence reads
Multiple cDNA clones
3 EST
11
Alternative Splicing from ESTs
Genomic
Primary transcript
Splicing
Splice variants
cDNA clones
EST sequences
5 3
5 3
12
Alternative Splicing from ESTs
ESTs can also provide information about potential
alternative splicing when aligned to the genome
(and when aligned to mRNA data)
13
EST sequencing
  • Is fast and cheap
  • Gives direct information about the gene sequence
  • Partial information

Resulting ESTs Known gene (DB searches) Similar
to known gene Contaminant Novel gene
14
ESTs provide expression data
eVOC Ontologies http//www.sanbi.ac.za/ev
oc/
15
Linking the expression vocabulary to gene
annotations
ESTs
Genes
16
Normalized vs. non-normalized libraries
17
The down side of the ESTs
  • Cannot detect lowly/rarely expressed genes or
    non-expressed sequences (regulatory)

Random sampling the more ESTs we sequence the
less new useful sequences we will get
18
Gene Hunting
  • Sequencing of the Human Genome (HGP)

EST Sequencing
19
Origin of the ESTs
  • Science. 1991 Jun 21252(5013)1651-6
  • Complementary DNA sequencing expressed sequence
    tags and human genome project.
  • Adams MD, Kelley JM, Gocayne JD, Dubnick M,
    Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde
    B, Moreno RF, et al.Section of Receptor
    Biochemistry and Molecular Biology, National
    Institute of Neurological Disorders and Stroke,
    National Institutes of Health, Bethesda, MD.

Automated partial DNA sequencing was conducted
on more than 600 randomly selected human brain
complementary DNA (cDNA) clones to generate
expressed sequence tags (ESTs). ESTs have
applications in the discovery of new human genes,
mapping of the human genome, and identification
of coding regions in genomic sequences. Of the
sequences generated, 337 represent new genes,
including 48 with significant similarity to genes
from other organisms, such as a yeast RNA
polymerase II subunit Drosophila kinesin, Notch,
and Enhancer of split and a murine tyrosine
kinase receptor. Forty-six ESTs were mapped to
chromosomes after amplification by the polymerase
chain reaction. This fast approach to cDNA
characterization will facilitate the tagging of
most human genes in a few years at a fraction of
the cost of complete genomic sequencing, provide
new genetic markers, and serve as a resource in
diverse biological research fields.
20
EST-sequencing explosion
? non-exclusivity (1992)
  • Merck and WashU (1994)
  • ? public ESTs
  • ? GenBank
  • ? dbEST

21
dbEST release 20 February 2004
  • Number of public entries 20,039,613
  • Summary by organism
  • Homo sapiens (human)
    5,472,005
  • Mus musculus domesticus (mouse) 4,056,481
  • Rattus sp. (rat)
    583,841
  • Triticum aestivum (wheat)
    549,926
  • Ciona intestinalis
    492,511
  • Gallus gallus (chicken)
    460,385
  • Danio rerio (zebrafish)
    450,652
  • Zea mays (maize)
    391,417
  • Xenopus laevis (African clawed frog)
    359,901

22
EST lengths
450 bp
Human EST length distribution (dbEST Sep. 2003 )
23
Recover the mRNA from the ESTs
24
What is an EST cluster?
A cluster is a set of fragmented EST data (plus
mRNA data if known), consolidated according to
sequence similarity Clusters are indexed by
gene such that all expressed data concerning a
single gene is in a single index class, and each
index class contains the information for only one
gene.   (Burke, Davison, Hide, Genome Research
1999).
25
EST pre-processing
Vector Repeats Mitochondrial Xenocontaminants
26
EST Clustering
  • UniGene (NCBI) www.ncbi.nlm.nih.gov/UniGene
  • TIGR Human Gene Index www.tigr.org
  • (The Institute for Genomic Research)
  • StackDB www.sanbi.ac.za
  • (South African Bioinformatics Institute)

27
UniGene
  • Species UniGene Entries
  • Homo sapiens 118,517
  • Mus musculus 82,482
  • Rattus norvegicus 43,942
  • Sus scrofa 20,426
  • Gallus gallus 11,970
  • Xenopus laevis 21,734
  • Xenopus tropicalis 17,102

28
  • ESTs and the Genome

29
ESTs aligned to the genome
  • Some advantages
  • It defines the location of exons and introns
  • We can verify the splice sites of introns (e.g.
    GT-AG)
  • ? hence also check the correct strand of spliced
    ESTs
  • It helps preventing chimeras
  • It can avoid putting together ESTs from
    paralogous genes
  • We can prevent including pseudogenes in our
    analysis

30
Aligning ESTs to the Genome
  • Many ESTs ? Fast programs, Fast computers
  • Nearly exact matches Coverage gt 97
  • Percent_id gt 97
  • Splice sites GTAG, ATAC, GCAG

31
Aligning ESTs to the Genome
Extra pre-processing of ESTs
  • Clip poly A tails/Clip 20bp from either end
  • Best in genome
  • Remove potential processed pseudogenes
  • Give preference to ESTs that are spliced

32
Human ESTGenes
Genomic length distribution of aligned human ESTs
400bp
Tail up to 800kb
33
The Problem
ESTs
Genome
What are the transcripts represented in this set
of mapped ESTs?
34
Predict Transcripts from ESTs
ESTs
Transcript predictions
Merge ESTs according to splicing structure
compatibility
35
Representation
Every 2 ESTs in a Genomic Cluster may represent
the same splicing (redundant) or not The
redundancy relation is a graph
x
x
Extension
y
y
x
Inclusion
z
x
z
Sort by the smallest coordinate ascending and by
the largest coordinate descending
36
Criteria of merging
Allow edge-exon mismatches
Allow internal mismatches
Allow intron mismatches
37
Transitivity
x
x
y
y
Extension
z
w
x
Inclusion
w
z
x
z
w
This reduces the number of comparisons needed
38
ClusterMerge graph
Each node defines an inclusion sub-tree
y
z
y
x
z
x
Extensions form acyclic graphs
x
x
y
z
y
z
w
w
39
Recovering the Solution
Mergeable sets of ESTs can be recovered
as special paths in the graph
1
4
2
3
5
6
7
9
8
40
Recovering the Solution
Root does not extend any node
1
Root
4
2
3
5
6
7
9
8
Leaves
Leaf not-extended and root of an inclusion tree
41
Recovering the Solution
Any set of ESTs in a path from a root to a leaf
is mergeable
1
Root
4
2
3
5
6
7
9
8
Leaves
42
Recovering the Solution
Add the inclusion tree attached to each node in
the path
1
Root
4
2
3
5
6
7
9
8
Leaves
43
Recovering the Solution
Lists produced (1,2,3,4,5,6,7,8) (
1,2,3,4,5,6,7,9)
1
4
2
3
5
6
7
9
8
This representation minimizes the necessary
comparisons between ESTs
44
How to build the graph
Mutual Recursion
Inclusion gt go up in the tree
Recursion search along extension branch
Search graph (leaves)
Search sub-graph
45
How to build the graph
Example
1
2
3
4
5
6
46
How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
47
How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Leaves
48
How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Inclusion
49
How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Inclusion
50
How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Extension
51
How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Inclusion
52
How to build the graph
Example
1
3
1
2
3
2
5
7
4
5
4
6
6
7
Place
53
How to build the graph
Example
1
3
1
2
3
2
5
7
4
5
4
6
6
7
Inclusion
54
How to build the graph
Example
1
3
1
2
3
2
5
7
4
5
4
6
6
7
tagged as visited - skip
55
How to build the graph
Example
1
3
1
2
3
2
5
7
4
5
4
6
6
7
Possible sub-trees beyond 1 or 3 remain
unseen! The representation minimizes the
necessary comparisons
56
Deriving the transcripts from the lists
Internal Splice Sites external coordinates of
the 5 and 3 exons are not allowed to
contribute
57
Deriving the transcripts from the lists
Splice Sites are set to the most common
coordinate 5 and 3 coordinates are set to
the exon coordinate that extends the
potential UTR the most
58
Single exon transcripts
Reject resulting single exon transcripts when
using ESTs
59
Annotation with ESTs
ESTs aligned to the genome can provide
information about UTRs and alternative splicing
60
Annotation with ESTs
EST-Transcripts at www.ensembl.org
61
Annotation with ESTs
62
Results for Human and Mouse
  • Human EST-genes (assembly ncbi33)
  • 38,581 Genes
  • 122,247Transcripts ( 42 with full CDS )
  • Mouse EST-genes (assembly ncbi30)
  • 32,848 Genes
  • 103,664 Transcripts ( 36 with full CDS )

63
  • How many transcripts are conserved?
  • Is Alternative Splicing conserved?

64
EST-transcript pairs
  • 42,625 transcript pairs (in 18,242 gene pairs)

gene pairs 78 with one transcript pair
conserved 22 with more than one transcript pair
conserved
For 22 of the gene pairs some form of alt.
splicing is conserved
65
Conservation of Alt. Splicing
  • Take gene-pairs with more than one
    transcript-pair

? ( number of paired
transcripts - 1) conservation
--------------------------------------------------
----- ? ( number of
transcripts - 1 ) ? sum over genes in a gene
pair with more than one variant ( subtract the
main transcript form)
19 of alt. variants in human are conserved in
mouse 32 of alt. variants in mouse are conserved
in human
66
  • How many predicted novel genes
  • are validated by Human-Mouse comparison?

67
Novel genes
ESTGenes Not in Ensembl
Human ESTGenes validated by comparison to mouse
13,174
18,242
24,201
ESTGenes with at least one complete ORF
68
Novel genes
ESTGenes not in Ensembl validated by comparison
to mouse
984
With a complete ORF
69
  • THE END
Write a Comment
User Comments (0)
About PowerShow.com