The Toy Exon Finder - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

The Toy Exon Finder

Description:

... genes, typically no more than 20 bp between successive genes on a chromosome ... The exons incorporate fairly strong codon biases and a significant GC bias ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 10
Provided by: StevenS79
Category:
Tags: chromosome | exon | finder | toy

less

Transcript and Presenter's Notes

Title: The Toy Exon Finder


1
The Toy Exon Finder
2
The Toy genome
  • The genome is very dense with genes, typically no
    more than 20 bp between successive genes on a
    chromosome
  • The exons tend to be very small, on the order of
    20 bp
  • The introns in multi-exon genes tend to be very
    small, about 20 bp
  • The exons incorporate fairly strong codon biases
    and a significant GC bias
  • The splice sites, start codons, and stop codons
    are flanked by positions with fairly strong base
    composition biases
  • The codon usage and base composition statistics
    can be well characterized with some sample data
  • Genes occur only on one strand, which we will
    call the Forward strand

3
Toyscan 1
  • Use GC content to find exons
  • Find all ORFs such that each ORF either
  • Begins with a START and ends with a STOP
  • Begins with a START and ends with a GT
  • Begins with AG and ends with GT
  • Begins with AG and ends with STOP
  • Set threshold t such that if an exon has GC
    content below t, label it as noncoding
  • For all remaining pairs of ORFs p1, p2, do
  • If p1 and p2 overlap, then discard with ORF with
    lower GC content
  • Output all ORFs that remain, calling them exons

4
Toyscan 2
  • Use codon bias to find exons
  • Codon frequencies for true exons are assumed to
    be known
  • Stop codons not included so they have probability
    0
  • Define codonBias function
  • For an input ORF (given), score all 3 frames
  • Ignore the fact that some frames have stop codons
    in them
  • Score sum of log probabilities of all codons in
    that frame
  • Probabilities are taken from the known
    probabilities
  • Divide Score by number of codons n. This
    normalizes it.
  • Output the highest score of the 3 frames

5
Toyscan 2 (cont.)
  • Note the codonBias function achieves its maximum
    when the observed distribution within an ORF
    matches the correct distribution from real
    genes
  • Define TOYSCAN_2 as
  • A codon bias score threshold, t, is input
  • For all ORFs, score them with the codonBias
    function
  • If the score is lt t, delete the ORF
  • For all remaining pairs of ORFs p1, p2, do
  • If p1, p2 overlap then discard the ORF with the
    lower codonBias score
  • Output all remaining ORFs as exons

6
Toyscan 3
  • Use codon bias and weight matrix models (WMMs)
  • Input includes WMMs for start, stop, donor, and
    acceptor sites
  • Donor WMM includes 5 positions after GT
  • Acceptor WMM includes 5 positions before AG
  • Start codon WMM includes 5 positions before ATG
  • Stop codon WMM includes 5 positions after
    TAA/TGA/TGA

7
Toyscan 3 (cont.)
  • Score a weight matrix (scoreWMM)
  • For each position i in the sequence S, sum the
    log probabilities of the bases in the interval
    (i,j) using the WMM, where j-i1 is the width of
    the WMM
  • Score an ORF (scoreORF)
  • choose the matrices to use on the left and right
    ends of the ORF
  • E.g., internal exon has acceptor on left, donor
    on right
  • Score WMM(left end) WMM(right end)
    codonBias
  • return Score

8
Toyscan 3 (cont.)
  • Now define Toyscan_3 as
  • assume a scoring threshold, t, is provided
  • You will have to experiment to find a good value
    for t
  • Get all ORFs
  • Score all ORFs using the scoreORF procedure
  • If the score is lt t, delete the ORF
  • For all remaining pairs of ORFs p1, p2, do
  • If p1, p2 overlap then discard the ORF with the
    lower scoreORF score
  • Output all remaining ORFs as exons

9
GFF format
  • coding GC 49
  • noncoding GC 50
  • 1 toy-genome initial-exon 31 46 . . transgrp1
  • 1 toy-genome final-exon 79 98 . . transgrp1
  • 1 toy-genome single-exon 129 140 .
    . transgrp2
  • 1 toy-genome single-exon 164 193 .
    . transgrp3
  • 1 toy-genome single-exon 228 260 .
    . transgrp4
  • 1 toy-genome single-exon 287 304 .
    . transgrp5
  • 1 toy-genome single-exon 331 354 .
    . transgrp6
  • 1 toy-genome single-exon 377 400 .
    . transgrp7
  • 1 toy-genome single-exon 424 435 .
    . transgrp8
  • 1 toy-genome initial-exon 475 488 .
    . transgrp9
  • 1 toy-genome internal-exon 512 526 .
    . transgrp9
  • 1 toy-genome final-exon 545 593 . . transgrp9
Write a Comment
User Comments (0)
About PowerShow.com