Title: Using Expressed Sequence Tags to Improve Gene Structure Prediction
1Using Expressed Sequence Tags to Improve Gene
Structure Prediction
- Chaochun Wei
- Advisor Michael Brent
- Dissertation Defense 4/17/2006
2The Challenge and Opportunity
- More than 1000 genomes
- 88 animals
- 29 plants
Gene Structure Annotation
Better Gene Structure Annotation
Using protein-coding-gene information in many
areas such as protein structure and function
study, disease gene finding, basic biology and
agriculture.
3Introduction DNA, mRNA, cDNA and EST
Coding Region
UTR
UTR
DNA
3
5
Transcription
Primary mRNA
3
5
RNA Processing
5
3
mRNA
Reverse Transcription
5 Truncated cDNA
Full-length cDNA
5
3
cDNA
3
5
- EST
- Short (650bs)
- High error rate (1-5)
- Contains only UTRs or coding regions
4Gene Structure Prediction Methods
- Categorizations
- Ab initio methods
- Only need genomic sequences as input
- GENSCAN (Burge 1997 Burge and Karlin 1997)
- Can predict novel genes
- Transcript-alignment-based methods
- Use cDNA, mRNA or Protein similarity as major
clues - ENSEMBL (Birney, Clamp, et. al. 2004)
- Highly accurate
- Can only find genes with transcript evidences
- cDNA coverage 50-60
- EST coverage up to 85
5Gene Prediction Methods (continue)
- Hybrid Methods
- Integrate cDNA, mRNA, protein and EST alignments
into ab initio methods - Genie (Kulp, Haussler et al. 1996)
- Fgenesh (Solovyev and Salamov 1997)
- Genomescan (Yeh, Lim et al. 2001)
- GAZE (Howe, Chothiea et al. 2002)
- AUGUSTUS (Stanke, Schoffmann et al. 2006)
6Comparative-Genomics-Based Methods TWINSCAN and
N-SCAN
- De novo gene prediction programs
- Assumption Coding regions are more conserved
than noncoding regions in evolution under the
natural selection pressure. - No transcript similarity information (like EST,
cDNA, mRNA, or protein) is used
7Sequence Representation of EST Alignments
- Use EST-to-genome alignment programs
- BLAT (Kent 2002)
- Project the top alignment for each EST to the
target genomic sequence
8Accuracy Measurement
Annotation
Prediction
Correct Prediction
9TWINSCAN_EST and N-SCAN_EST on the Whole Human
Genome
10N-SCAN_EST vs. AUGUSTUS
- On human chromosome 22
- Use the same EST alignments
11An Example of N-SCAN_EST Prediction
(Hg17, chr2133,459,500-33,465,411)
12An Example of N-SCAN_EST Prediction
13Part 1 Conclusion
- A Novel approach to use ESTs for gene prediction
- Simple
- Effective
- Improve in both coding and non-coding regions
- Significantly better accuracy
- Trainable
- ESTs can significantly improve gene prediction
accuracy.
14Part 2 A New EST-to-genome Alignment Algorithm
- Motivation
- A better EST-to-genome alignment program to
improve gene prediction -
15A Correct Alignment Can Be Other Than a Match
- SNPs (one in every 100-300 bases)
- Dominate in high quality regions
- EST sequencing quality values
- Dominate in low quality regions
- Other
Where P is the estimated error probability for
the base
16A Graphical Model for Error Patterns in Correct
Alignments
SNPs
EG
EG
Sequencing Error
Model
Null-Model
RG Genomic sequence EG EST/cDNA
sequence EC EST base call qual Quality value
17Graphical Model for Error Patterns in Correct
Alignments
- From sequencing error distribution (Ewing and
Green 1998) - From dbSNP human data Pr(RGEG)
- From human genome Pr(RG)
18Alignment Scores from the Graphical Model for
Error Patterns in Correct Alignments
19PairHMM using Quality Value Sequence
End
Begin
20Stepping Stone Algorithm and Intron-cutout
Algorithm
- Using BLAST HSPs as heuristic
- Restricted alignment
Genomic Sequence
EST Sequence
21Result Mismatches vs. Quality Values
On 3,053 non-overlapping ESTs on chr20, 21 and 22
22Results Mismatches Explained by SNPs or
Sequencing Errors
On 3,053 non-overlapping ESTs in chr20, 21 and 22
Compared to 38,509 SNPs in coding regions or
UTRs Compared aligned mismatches
23Effect of Different Alignment Programs on Gene
Prediction
TWINSCAN_EST on human genome Chromosome 20, 21,
and 22
24Posterior Probability for an Alignment
Posterior Probability
Quality Value
25Part 2 Conclusions
- A Graphical Model provides a framework for error
patterns in correct alignments - QPAIRAGON
- improves the alignment by reducing errors in the
high quality regions - improves the accuracy of gene prediction system
using EST information - Provides posterior probability as a reliability
measurement of an alignment
26Future Work
- Speed up QPAIRAGON
- Extend to
- SNP finding
- UTR prediction
- Alternative splicing detection
27Acknowledgements
- Advisor Michael Brent
- Lab members
- Jeltje van Baren
- Mani Arumugam
- Randy Brown
- Aaron Tenney
- Evan Keibler
- Robert Zimmermann