Bacterial Gene Finding and Glimmer also Archaeal and viral gene finding - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Bacterial Gene Finding and Glimmer also Archaeal and viral gene finding

Description:

DNA region gets 7 scores: 6 reading frames & non-coding?high score wins. ... Art Delcher. Steven Salzberg. Owen White (TIGR) Simon Kasif (Boston U. ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 52
Provided by: Adma3
Category:

less

Transcript and Presenter's Notes

Title: Bacterial Gene Finding and Glimmer also Archaeal and viral gene finding


1
Bacterial Gene Finding and Glimmer(also Archaeal
and viral gene finding)
  • Arthur L. Delcher and Steven SalzbergCenter for
    Bioinformatics and Computational Biology
  • University of Maryland

2
Outline
  • A (very) brief overview of microbial gene-finding
  • Codon composition methods
  • GeneMark Markov models
  • Glimmer1 2
  • Interpolated Markov Model (IMM)
  • Interpolated Context Model (ICM)
  • Glimmer3
  • Reducing false positives
  • Improving coding initiation site predictions
  • Running Glimmer3

3
Step One
  • Find open reading frames (ORFs).

TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA
Stopcodon
Stopcodon
4
Step One
  • Find open reading frames (ORFs).
  • But ORFs generally overlap

Reversestrand
Stopcodon
ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT
TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA
Stopcodon
Stopcodon
ShiftedStop
5
Campylobacter jejuni RM1221 30.3GC
All ORFs on both strands shown - color indicates
reading frame Longest ORFs likely to be
protein-coding genes Note the low GC content
6
Campylobacter jejuni RM1221 30.3GC
Purple lines are the predicted genes
Purple ORFs show annotated (true) genes
7
Campylobacter jejuni RM1221 30.3GC
Mycobacterium smegmatis MC2 67.4GC
Note what happens in a high-GC genome
8
Campylobacter jejuni RM1221 30.3GC
Mycobacterium smegmatis MC2 67.4GC
Purple lines show annotated genes
9
The Problem
  • Need to decide which orfs are genes.
  • Then figure out the coding start sites
  • Can do homology searches but that wont find
    novel genes
  • Besides, there are errors in the databases
  • Generally can assume that there are some known
    genes to use as training set.
  • Or just find the obvious ones

10
Codon Composition
  • Find patterns of nucleotides in known coding
    regions (assumed to be available).
  • Nucleotide distribution at 3 codon positions
  • Hexamers
  • GC-skew
  • (G-C)/(GC) computed in windows of size N
  • Amino-acid composition
  • Use these to decide which orfs are genes.
  • Prefer longer orfs
  • Must deal with overlaps

11
Bacterial Replication
Early replication
Theta structure
12
Termination of Replication
B. subtilis
E. coli
13
Borrelia burgdorferi (Lyme disease
pathogen)GC-skew plot
14
Codon Composition
  • Nucleotide variation at codon position

15
Codon-Composition Gene Finders
  • ZCURVE
  • Guo, Ou Zhang, NAR 31, 2003
  • Based on nucleotide and di-nucleotide frequency
    in codons
  • Uses Z-transform and Fisher linear discriminant
  • MED
  • Ouyang, Zhu, Wang She, JBCB 2(2) 2004
  • Based on amino-acid frequencies
  • Uses nearest-neighbor classification on
    entropies

16
Probabilistic Methods
  • Create models that have a probability of
    generating any given sequence.
  • Train the models using examples of the types of
    sequences to generate.
  • The score of an orf is the probability of the
    model generating it.
  • Can also use a negative model (i.e., a model of
    non-orfs) and make the score be the ratio of the
    probabilities (i.e., the odds) of the two models.
  • Use logs to avoid underflow

17
Fixed-Order Markov Models
  • k th-order Markov model bases the probability of
    an event on the preceding k events.
  • Example With a 3rd-order model the probability
    of this sequencewould be

Target
Target
18
Fixed-Order Markov Models
  • Advantages
  • Easy to train. Count frequencies of (k1)-mers
    in training data.
  • Easy to compute probability of sequence.
  • Disadvantages
  • Many (k1)-mers may be undersampled in training
    data.
  • Models data as fixed-length chunks.

Fixed-LengthContext
Target
19
GeneMark
  • Borodovsky McIninch, Comp. Chem 17, 1993.
  • Uses 5th-order Markov model.
  • Model is 3-periodic, i.e., a separate model for
    each nucleotide position in the codon.
  • DNA region gets 7 scores 6 reading frames
    non-coding?high score wins.
  • Lukashin Borodovsky, Nucl. Acids Res. 26, 1998
    is the HMM version.

20
Interpolated Markov Models (IMM)
  • Introduced in Glimmer 1.0Salzberg, Delcher,
    Kasif White, NAR 26, 1998.
  • Probability of the target position depends on a
    variable number of previous positions (sometimes
    2 bases, sometimes 3, 4, etc.)
  • How many is determined by the specific context.
  • E.g., for context ggtta the next position might
    depend on previous 3 bases tta .
  • But for context catta all 5 bases might be
    used.

ggtta
21
Real IMMs
  • Model has additional probabilities, ?, that
    determine which parts of the context to use.
  • E.g., the probability of g occurring after
    context atca is

22
Real IMMs
  • Result is a linear combination of different
    Markov orderswhere
  • Can view this as interpolating the results of
    different-order models.
  • The probability of a sequence is still the
    probability of the bases in the sequence.

23
Real IMMs
  • Problem How to determine the ?s (or
    equivalently the bjs)?
  • Traditionally done with EM algorithm using
    cross-validation (deleted estimation).
  • Slow
  • Hard to understand results
  • Overtraining can be a problem
  • We will cover EM later as part of HMMs

24
Glimmer IMM
  • Glimmer assumes
  • Longer context is always better
  • Only reason not to use it is undersampling in
    training data.
  • If sequence occurs frequently enough in training
    data, use it, i.e.,
  • Otherwise, use frequency and ?2 significance to
    set ?.
  • Interpolation is always between only 2 adjacent
    model lengths.

25
More Precisely
  • Suppose context of length k1 occurs a times,
    alt400, with target distribution D1 and context
    of length k occurs 400 times, with target
    distribution D2
  • Let p be the ?2 probability that D1 differs from
    D2
  • Then the interpolated distribution will be

This formula computes the probability of a base B
at any position in a sequence, given the context
preceding that position This formula is
repeatedly invoked for all positions in an ORF
26
IMMs vs Fixed-Order Models
  • Performance
  • IMM generally should do at least as well as a
    fixed-order model.
  • Some risk of overtraining.
  • IMM result can be stored and used like a
    fixed-order model.
  • IMM will be somewhat slower to train and will use
    more memory.

Variable-LengthContext
Target
27
Interpolated Context Model (ICM)
  • Introduced in Glimmer 2.0 Delcher, Harmon, et
    al., Nucl. Acids Res. 27, 1999.
  • Doesnt require adjacent bases in the window
    preceding the target position.
  • Choose set of positions that are most informative
    about the target position.

Variable-PositionContext
Target
28
ICM
  • For all windows compare distribution at each
    context position with target position
  • Choose position with max mutual information

Compare
29
ICM
  • Continue for windows with fixed base at chosen
    positions
  • Recurse until too few training windows
  • Result is a treedepth is of context positions
    used
  • Then same interpolation as IMM, between node and
    parent in tree

Compare
30
Sample ICM Model
ver 2.00 len 12 depth 7 periodicity 3
nodes 21845 0 ----------? 0.0260
0.225 0.240 0.358 0.177 1
---------a? 0.0769 0.243 0.165 0.501
0.092 2 ---------c? 0.0243 0.237
0.294 0.264 0.205 3 ---------g?
0.0597 0.219 0.240 0.407 0.134 4
---------t? 0.0277 0.209 0.245 0.291
0.256 5 --------aa? 0.0051 0.320
0.202 0.478 0.000 6 --------ac?
0.0046 0.195 0.118 0.541 0.146 7
--------ag? 0.0101 0.180 0.207 0.613
0.000 8 --------at? 0.0037 0.236
0.147 0.400 0.217 9 --------ca?
0.0019 0.199 0.246 0.303 0.253 10
--------cc? 0.0090 0.297 0.189 0.308
0.207 11 --------cg? 0.0035 0.180
0.406 0.267 0.148 12 --------ct?
0.0083 0.279 0.327 0.189 0.206 13
--------ga? 0.0057 0.277 0.326 0.397
0.000 14 --------gc? 0.0057 0.233
0.158 0.473 0.136 15 --------gg?
0.0056 0.119 0.252 0.417 0.213 16
--------gt? 0.0096 0.207 0.232 0.359
0.203 17 --------ta? 0.0034 0.205
0.170 0.400 0.224 18 --------tc?
0.0062 0.179 0.238 0.290 0.293 19
--------tg? 0.0052 0.183 0.338 0.323
0.156 20 --------tt? 0.0076 0.244
0.244 0.194 0.318 21 -------aaa?
0.0035 0.376 0.199 0.425 0.000 22
-------caa? 0.0037 0.313 0.197 0.490
0.000 23 -------gaa? 0.0052 0.283
0.199 0.517 0.000 24 -------taa?
0.0060 0.259 0.231 0.510 0.000 25
-------aac? 0.0031 0.176 0.138 0.519
0.166
31
ICM
  • ICMs not just for modeling genes
  • Can use to model any set of training sequences
  • Doesnt even need to be DNA
  • Have used recently to finding contaminant
    sequences in shotgun sequencing projects.
  • Used a non-periodic (i.e., homogeneous) model

32
Fixed-Length Sequences and ICMs
  • Array of ICMs?a different one for each position
    in fixed-length sequence.
  • Can model fixed regions around transcription
    start sites or splice sites.
  • Positions in region can be permuted.

33
Overlapping Orfs
  • Glimmer1 2 used rules.
  • For overlapping orfs A and B, the overlap region
    AB is scored separately
  • If AB scores higher in As reading frame, and A
    is longer than B, then reject B.
  • If AB scores higher in Bs reading frame, and B
    is longer than A, then reject A.
  • Otherwise, output both A and B with a
    suspicious tag.
  • Also try to move start site to eliminate
    overlaps.
  • Leads to high false-positive rate, especially in
    high-GC genomes.

34
Glimmer 2.0 Overlap Comments
  • OrfID Start End Frame Len Score
    Comments
  • 2 499 1689 1 L1191 r-1.151
    ShadowedBy 3
  • 3 1977 418 -1 L1560 r-1.317
    Contains 2 LowScoreBy 4 L257 S0
  • 4 1721 2611 2 L 891 r-1.156
    OlapWith 3 L257 S99
  • 5 2624 3775 2 L1152 r-1.184
  • 7 3775 4356 1 L 582 r-1.223
    DelayedBy 5 L216
  • 8 4501 6615 1 L2115 r-1.181
    OlapWith 9 L2103 S99
  • 9 6633 4513 -1 L2121 r-1.365
    LowScoreBy 8 L2103 S0
  • 10 6648 9173 3 L2526 r-1.155
  • 11 9229 10008 1 L 780 r-1.217
  • 12 10411 11208 1 L 798 r-1.192
  • 13 11215 12243 1 L1029 r-1.165
    ShorterThan 15 L865 S99
  • 15 12827 11379 -3 L1449 r-1.295
    LowScoreBy 13 L865 S0 LowScoreBy 16
  • 16 12243 13298 3 L1056 r-1.228
    OlapWith 15 L585 S99 DelayedBy 13 L
  • 17 13298 14137 2 L 840 r-1.161
  • 18 15143 14133 -3 L1011 r-1.230
  • 19 15193 15519 1 L 327 r-1.284
  • 20 15519 17249 3 L1731 r-1.175
  • 21 17249 19015 2 L1767 r-1.266
    DelayedBy 20 L1263

35
Glimmer3
  • Uses a dynamic programming algorithm to choose
    the highest-scoring set of orfs and start sites.
  • Not quite an HMM
  • Allows small overlaps between genes
  • small is user-defined
  • Scores of genes are not necessarily
    probabilities.
  • Score includes component for likelihood of start
    site

36
Reverse Scoring
  • In Glimmer3 orfs are scored from 3 end to 5
    end, i.e., from stop codon back toward start
    codon.
  • Helps find the start site.
  • The start should appear near the peak of the
    cumulative score in this direction.
  • Keeps the context part of the model entirely in
    the coding portion of gene, which it was trained
    on.

37
Reverse Scoring
38
Finding Start Sites
  • Can specify a position weight matrix (PWM) for
    the ribosome binding site.
  • Used to boost the scoreof potential start sites
    that have a match.
  • Would like to try
  • A fixed-length ICM model of start sites.
  • A decision-tree model of start sites.
  • Hard to get enough reliable data.
  • Glimmer2 had a hybridization-energy calculation
    with 16sRNA tail sequence that didnt work well.

39
Glimmer3 vs. Glimmer2
Table 3. Probability models trained on the output
of the long-orfs program. Predictions compared to
genes with annotated function. Table 4.
Probability models trained on the output of the
long-orfs program. Predictions compared to all
annotated genes.
40
Other Glimmer3 Features
  • Can specify any set of start or stop codons
  • Including by NCBI translation table
  • Can specify list of guaranteed genes
  • Or just orfs and let Glimmer choose the starts
  • Can specify genome regions to avoid
  • Guarantee no prediction will overlap these
    regions
  • Useful for RNA genes (tRNAs, rRNAs)
  • Output more meaningful orf scores.
  • Allow genes to extend beyond end of sequence
  • important for annotation of incomplete genomes

41
Glimmer3 Output
  • ----- Start ----- --- Length
    ---- -------------- Scores ------------
  • ID Frame of Orf of Gene Stop of Orf of
    Gene Raw Frame 1 2 3 -1 -2 -3 NC
  • ... ... ... ... ...
    ... ... ...
  • -2 10516 10459 10337 177
    120 1.14 0 - - 99 - 0 - 0
  • -3 10787 10781 10629 156
    150 -10.69 0 - - 99 - - 0 0
  • -3 11132 11120 11004 126
    114 -13.54 0 - - 99 - - 0 0
  • -2 12058 12055 11936 120
    117 -2.25 0 - - 99 - 0 - 0
  • -3 12437 12371 12249 186
    120 -13.68 0 - - 99 0 - 0 0
  • 0007 3 8109 8142 12632 4521
    4488 16.05 99 - - 99 - - - 0
  • 0008 -2 12763 12724 12545 216
    177 4.17 99 - - - - 99 - 0
  • 1 12850 12877 12996 144
    117 -16.13 0 0 - 99 - - - 0
  • -2 13165 13036 12905 258
    129 -1.31 0 - - 99 - 0 - 0
  • -3 13649 13634 13473 174
    159 -7.11 0 - - 99 - 0 0 0
  • -2 13675 13663 13256 417
    405 2.83 0 - - 99 - 0 - 0
  • -2 13972 13972 13850 120
    120 4.25 0 - - 99 - 0 - 0
  • 0009 3 12633 12642 14087 1452
    1443 15.20 99 - - 99 - - - 0
  • 1 14194 14323 14454 258
    129 1.62 0 0 - - - - 99 0
  • 0010 -3 14669 14663 14088 579
    573 13.36 99 - - - - - 99 0
  • 1 14482 14548 14673 189
    123 3.07 0 0 - - - - 99 0

42
Finding Initial Training Set
  • Glimmer1 2 have program long-orfs
  • It finds orfs longer than min-length that do not
    overlap other such orfs.
  • Current version automatically finds min-length.
  • Works OK for low- to medium-GC genomes
  • Terrible for high-GC genomes
  • Up to half the orfs can be wrong.
  • Uses first possible start site?even if orf is
    correct much of sequence isnt.
  • Can use codon composition information in Glimmer3
    version (similar to MED).

43
(No Transcript)
44
(No Transcript)
45
Running Glimmer3
1 Find long, non-overlapping orfs to use as a
training set long-orfs --no_header -t 1.0
tpall.1con tp.longorfs 2 Extract the training
sequences from the genome file extract --nostop
tpall.1con tp.longorfs gt tp.train 3 Build the
icm from the training sequences build-icm -r
tp.icm lt tp.train 4 Run first Glimmer glimmer3
-o50 -g110 -t30 tpall.1con tp.icm tp.run1 5 Get
training coordinates from first predictions tail
2 tp.run1.predict gt tp.coords 6 Create a
position weight matrix (PWM) from the regions
upstream of the start locations in
tp.coords upstream-coords.awk 25 0 tp.coords
extract tpall.1con - gt tp.upstream elph
tp.upstream LEN6 get-motif-counts.awk gt
tp.motif 7 Determine the distribution of
start-codon usage in tp.coords set startuse
start-codon-distrib --3comma tpall.1con
tp.coords 8 Run second Glimmer glimmer3 -o50
-g110 -t30 -b tp.motif -P startuse tpall.1con
tp.icm tp
46
A novel application of Glimmer
  • P. didemni is a photosynthetic microbe that lives
    as an endosymbiont in the sea squirt gt patella
  • P. didemni can only be cultured in L. patella
    cells

L. patella (sea squirt)
47
A novel application of Glimmer
  • Generated 82,337 shotgun reads
  • Bacterial genome 5 Mb
  • Host genome estimated at 160 Mb
  • Depth of coverage therefore much greater for
    bacterial contigs
  • Singleton reads primarily belong to host

48
A novel application of Glimmer
  • Create training sets by classifying reads from
    scaffolds gt 10kb as bacterial
  • 36,920 reads
  • Reads where both read and mate were singletons
    were treated as sea squirt
  • 21,276 reads
  • 21,141 reads unclassified

49
A novel application of Glimmer
  • Train a non-periodic IMM on both sets of data
  • 2 IMMs created
  • Then classify reads using the ratio of scores
    from the two IMMs
  • In a 5-fold cross-validation, classification
    accuracy was
  • 98.9 on P. didemni reads
  • 99.9 on L. patella reads

50
A novel application of Glimmer
  • Finally, re-assemble using ONLY bacterial reads
  • Original assembly
  • 65 scaffolds of 20 Kb or longer
  • total scaffold length 5.74 Mb
  • Improved assembly
  • 58 scaffolds of 20 Kb or longer
  • total scaffold length 5.84 Mb

51
Acknowledgements
  • Art Delcher
  • Steven Salzberg
  • Owen White (TIGR)Simon Kasif (Boston U.)Doug
    Harmon (Loyola College)
  • Kirsten BratkeEdwin Powers (Johns Hopkins U.)
  • Dan Haft (TIGR)
  • Bill Nelson (TIGR)
Write a Comment
User Comments (0)
About PowerShow.com