Using Expressed Sequence Tags to Improve Gene Structure Prediction - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Using Expressed Sequence Tags to Improve Gene Structure Prediction

Description:

Stepping Stone Algorithm and Intron-cutout Algorithm. Genomic Sequence. EST Sequence. Intron Cutout. 21. Result: Mismatches vs. Quality Values ... – PowerPoint PPT presentation

Number of Views:294
Avg rating:3.0/5.0
Slides: 28
Provided by: wei29
Category:

less

Transcript and Presenter's Notes

Title: Using Expressed Sequence Tags to Improve Gene Structure Prediction


1
Using Expressed Sequence Tags to Improve Gene
Structure Prediction
  • Chaochun Wei
  • Advisor Michael Brent
  • Dissertation Defense 4/17/2006

2
The Challenge and Opportunity
  • More than 1000 genomes
  • 88 animals
  • 29 plants

Gene Structure Annotation
Better Gene Structure Annotation
Using protein-coding-gene information in many
areas such as protein structure and function
study, disease gene finding, basic biology and
agriculture.
3
Introduction DNA, mRNA, cDNA and EST
Coding Region
UTR
UTR
DNA
3
5
Transcription
Primary mRNA
3
5
RNA Processing
5
3
mRNA
Reverse Transcription
5 Truncated cDNA
Full-length cDNA
5
3
cDNA
3
5
  • EST
  • Short (650bs)
  • High error rate (1-5)
  • Contains only UTRs or coding regions

4
Gene Structure Prediction Methods
  • Categorizations
  • Ab initio methods
  • Only need genomic sequences as input
  • GENSCAN (Burge 1997 Burge and Karlin 1997)
  • Can predict novel genes
  • Transcript-alignment-based methods
  • Use cDNA, mRNA or Protein similarity as major
    clues
  • ENSEMBL (Birney, Clamp, et. al. 2004)
  • Highly accurate
  • Can only find genes with transcript evidences
  • cDNA coverage 50-60
  • EST coverage up to 85

5
Gene Prediction Methods (continue)
  • Hybrid Methods
  • Integrate cDNA, mRNA, protein and EST alignments
    into ab initio methods
  • Genie (Kulp, Haussler et al. 1996)
  • Fgenesh (Solovyev and Salamov 1997)
  • Genomescan (Yeh, Lim et al. 2001)
  • GAZE (Howe, Chothiea et al. 2002)
  • AUGUSTUS (Stanke, Schoffmann et al. 2006)

6
Comparative-Genomics-Based Methods TWINSCAN and
N-SCAN
  • De novo gene prediction programs
  • Assumption Coding regions are more conserved
    than noncoding regions in evolution under the
    natural selection pressure.
  • No transcript similarity information (like EST,
    cDNA, mRNA, or protein) is used

7
Sequence Representation of EST Alignments
  • Use EST-to-genome alignment programs
  • BLAT (Kent 2002)
  • Project the top alignment for each EST to the
    target genomic sequence

8
Accuracy Measurement
Annotation
Prediction
Correct Prediction
9
TWINSCAN_EST and N-SCAN_EST on the Whole Human
Genome
10
N-SCAN_EST vs. AUGUSTUS
  • On human chromosome 22
  • Use the same EST alignments

11
An Example of N-SCAN_EST Prediction
(Hg17, chr2133,459,500-33,465,411)
12
An Example of N-SCAN_EST Prediction
13
Part 1 Conclusion
  • A Novel approach to use ESTs for gene prediction
  • Simple
  • Effective
  • Improve in both coding and non-coding regions
  • Significantly better accuracy
  • Trainable
  • ESTs can significantly improve gene prediction
    accuracy.

14
Part 2 A New EST-to-genome Alignment Algorithm
  • Motivation
  • A better EST-to-genome alignment program to
    improve gene prediction

15
A Correct Alignment Can Be Other Than a Match
  • SNPs (one in every 100-300 bases)
  • Dominate in high quality regions
  • EST sequencing quality values
  • Dominate in low quality regions
  • Other

Where P is the estimated error probability for
the base
16
A Graphical Model for Error Patterns in Correct
Alignments
SNPs
EG
EG
Sequencing Error
Model
Null-Model
RG Genomic sequence EG EST/cDNA
sequence EC EST base call qual Quality value
17
Graphical Model for Error Patterns in Correct
Alignments
  • From sequencing error distribution (Ewing and
    Green 1998)
  • From dbSNP human data Pr(RGEG)
  • From human genome Pr(RG)

18
Alignment Scores from the Graphical Model for
Error Patterns in Correct Alignments
19
PairHMM using Quality Value Sequence
End
Begin
20
Stepping Stone Algorithm and Intron-cutout
Algorithm
  • Using BLAST HSPs as heuristic
  • Restricted alignment

Genomic Sequence
EST Sequence
21
Result Mismatches vs. Quality Values
On 3,053 non-overlapping ESTs on chr20, 21 and 22
22
Results Mismatches Explained by SNPs or
Sequencing Errors
On 3,053 non-overlapping ESTs in chr20, 21 and 22
Compared to 38,509 SNPs in coding regions or
UTRs Compared aligned mismatches
23
Effect of Different Alignment Programs on Gene
Prediction
TWINSCAN_EST on human genome Chromosome 20, 21,
and 22
24
Posterior Probability for an Alignment
Posterior Probability
Quality Value
25
Part 2 Conclusions
  • A Graphical Model provides a framework for error
    patterns in correct alignments
  • QPAIRAGON
  • improves the alignment by reducing errors in the
    high quality regions
  • improves the accuracy of gene prediction system
    using EST information
  • Provides posterior probability as a reliability
    measurement of an alignment

26
Future Work
  • Speed up QPAIRAGON
  • Extend to
  • SNP finding
  • UTR prediction
  • Alternative splicing detection

27
Acknowledgements
  • Advisor Michael Brent
  • Lab members
  • Jeltje van Baren
  • Mani Arumugam
  • Randy Brown
  • Aaron Tenney
  • Evan Keibler
  • Robert Zimmermann
Write a Comment
User Comments (0)
About PowerShow.com