Using Expressed Sequence Tags to Improve Gene Structure Prediction - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Using Expressed Sequence Tags to Improve Gene Structure Prediction

Description:

Stepping Stone Algorithm and Intron-cutout Algorithm. Genomic Sequence. EST Sequence. Intron Cutout. 21. Result: Mismatches vs. Quality Values ... – PowerPoint PPT presentation

Number of Views:294

Avg rating:3.0/5.0

Slides: 28

Provided by: wei29

Category:

more less

Transcript and Presenter's Notes

Title: Using Expressed Sequence Tags to Improve Gene Structure Prediction

1
Using Expressed Sequence Tags to Improve Gene
Structure Prediction

Chaochun Wei
Advisor Michael Brent
Dissertation Defense 4/17/2006

2
The Challenge and Opportunity

More than 1000 genomes
88 animals
29 plants

Gene Structure Annotation
Better Gene Structure Annotation
Using protein-coding-gene information in many
areas such as protein structure and function
study, disease gene finding, basic biology and
agriculture.
3
Introduction DNA, mRNA, cDNA and EST
Coding Region
UTR
UTR
DNA
3
5
Transcription
Primary mRNA
3
5
RNA Processing
5
3
mRNA
Reverse Transcription
5 Truncated cDNA
Full-length cDNA
5
3
cDNA
3
5

EST
Short (650bs)
High error rate (1-5)
Contains only UTRs or coding regions

4
Gene Structure Prediction Methods

Categorizations
Ab initio methods
Only need genomic sequences as input
GENSCAN (Burge 1997 Burge and Karlin 1997)
Can predict novel genes
Transcript-alignment-based methods
Use cDNA, mRNA or Protein similarity as major
clues
ENSEMBL (Birney, Clamp, et. al. 2004)
Highly accurate
Can only find genes with transcript evidences
cDNA coverage 50-60
EST coverage up to 85

5
Gene Prediction Methods (continue)

Hybrid Methods
Integrate cDNA, mRNA, protein and EST alignments
into ab initio methods
Genie (Kulp, Haussler et al. 1996)
Fgenesh (Solovyev and Salamov 1997)
Genomescan (Yeh, Lim et al. 2001)
GAZE (Howe, Chothiea et al. 2002)
AUGUSTUS (Stanke, Schoffmann et al. 2006)

6
Comparative-Genomics-Based Methods TWINSCAN and
N-SCAN

De novo gene prediction programs
Assumption Coding regions are more conserved
than noncoding regions in evolution under the
natural selection pressure.
No transcript similarity information (like EST,
cDNA, mRNA, or protein) is used

7
Sequence Representation of EST Alignments

Use EST-to-genome alignment programs
BLAT (Kent 2002)
Project the top alignment for each EST to the
target genomic sequence

8
Accuracy Measurement
Annotation
Prediction
Correct Prediction
9
TWINSCAN_EST and N-SCAN_EST on the Whole Human
Genome
10
N-SCAN_EST vs. AUGUSTUS

On human chromosome 22
Use the same EST alignments

11
An Example of N-SCAN_EST Prediction
(Hg17, chr2133,459,500-33,465,411)
12
An Example of N-SCAN_EST Prediction
13
Part 1 Conclusion

A Novel approach to use ESTs for gene prediction
Simple
Effective
Improve in both coding and non-coding regions
Significantly better accuracy
Trainable
ESTs can significantly improve gene prediction
accuracy.

14
Part 2 A New EST-to-genome Alignment Algorithm

Motivation
A better EST-to-genome alignment program to
improve gene prediction

15
A Correct Alignment Can Be Other Than a Match

SNPs (one in every 100-300 bases)
Dominate in high quality regions
EST sequencing quality values
Dominate in low quality regions
Other

Where P is the estimated error probability for
the base
16
A Graphical Model for Error Patterns in Correct
Alignments
SNPs
EG
EG
Sequencing Error
Model
Null-Model
RG Genomic sequence EG EST/cDNA
sequence EC EST base call qual Quality value
17
Graphical Model for Error Patterns in Correct
Alignments

From sequencing error distribution (Ewing and
Green 1998)
From dbSNP human data Pr(RGEG)
From human genome Pr(RG)

18
Alignment Scores from the Graphical Model for
Error Patterns in Correct Alignments
19
PairHMM using Quality Value Sequence
End
Begin
20
Stepping Stone Algorithm and Intron-cutout
Algorithm

Using BLAST HSPs as heuristic
Restricted alignment

Genomic Sequence
EST Sequence
21
Result Mismatches vs. Quality Values
On 3,053 non-overlapping ESTs on chr20, 21 and 22
22
Results Mismatches Explained by SNPs or
Sequencing Errors
On 3,053 non-overlapping ESTs in chr20, 21 and 22
Compared to 38,509 SNPs in coding regions or
UTRs Compared aligned mismatches
23
Effect of Different Alignment Programs on Gene
Prediction
TWINSCAN_EST on human genome Chromosome 20, 21,
and 22
24
Posterior Probability for an Alignment
Posterior Probability
Quality Value
25
Part 2 Conclusions