Title: Bacterial Gene Finding and Glimmer also Archaeal and viral gene finding
1Bacterial Gene Finding and Glimmer(also Archaeal
and viral gene finding)
- Arthur L. Delcher and Steven SalzbergCenter for
Bioinformatics and Computational Biology - University of Maryland
2Outline
- A (very) brief overview of microbial gene-finding
- Codon composition methods
- GeneMark Markov models
- Glimmer1 2
- Interpolated Markov Model (IMM)
- Interpolated Context Model (ICM)
- Glimmer3
- Reducing false positives
- Improving coding initiation site predictions
- Running Glimmer3
3Step One
- Find open reading frames (ORFs).
TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA
Stopcodon
Stopcodon
4Step One
- Find open reading frames (ORFs).
- But ORFs generally overlap
Reversestrand
Stopcodon
ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT
TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA
Stopcodon
Stopcodon
ShiftedStop
5Campylobacter jejuni RM1221 30.3GC
All ORFs on both strands shown - color indicates
reading frame Longest ORFs likely to be
protein-coding genes Note the low GC content
6Campylobacter jejuni RM1221 30.3GC
Purple lines are the predicted genes
Purple ORFs show annotated (true) genes
7Campylobacter jejuni RM1221 30.3GC
Mycobacterium smegmatis MC2 67.4GC
Note what happens in a high-GC genome
8Campylobacter jejuni RM1221 30.3GC
Mycobacterium smegmatis MC2 67.4GC
Purple lines show annotated genes
9The Problem
- Need to decide which orfs are genes.
- Then figure out the coding start sites
- Can do homology searches but that wont find
novel genes - Besides, there are errors in the databases
- Generally can assume that there are some known
genes to use as training set. - Or just find the obvious ones
10Codon Composition
- Find patterns of nucleotides in known coding
regions (assumed to be available). - Nucleotide distribution at 3 codon positions
- Hexamers
- GC-skew
- (G-C)/(GC) computed in windows of size N
- Amino-acid composition
- Use these to decide which orfs are genes.
- Prefer longer orfs
- Must deal with overlaps
11Bacterial Replication
Early replication
Theta structure
12Termination of Replication
B. subtilis
E. coli
13Borrelia burgdorferi (Lyme disease
pathogen)GC-skew plot
14Codon Composition
- Nucleotide variation at codon position
15Codon-Composition Gene Finders
- ZCURVE
- Guo, Ou Zhang, NAR 31, 2003
- Based on nucleotide and di-nucleotide frequency
in codons - Uses Z-transform and Fisher linear discriminant
- MED
- Ouyang, Zhu, Wang She, JBCB 2(2) 2004
- Based on amino-acid frequencies
- Uses nearest-neighbor classification on
entropies
16Probabilistic Methods
- Create models that have a probability of
generating any given sequence. - Train the models using examples of the types of
sequences to generate. - The score of an orf is the probability of the
model generating it. - Can also use a negative model (i.e., a model of
non-orfs) and make the score be the ratio of the
probabilities (i.e., the odds) of the two models. - Use logs to avoid underflow
17Fixed-Order Markov Models
- k th-order Markov model bases the probability of
an event on the preceding k events. - Example With a 3rd-order model the probability
of this sequencewould be
Target
Target
18Fixed-Order Markov Models
- Advantages
- Easy to train. Count frequencies of (k1)-mers
in training data. - Easy to compute probability of sequence.
- Disadvantages
- Many (k1)-mers may be undersampled in training
data. - Models data as fixed-length chunks.
Fixed-LengthContext
Target
19GeneMark
- Borodovsky McIninch, Comp. Chem 17, 1993.
- Uses 5th-order Markov model.
- Model is 3-periodic, i.e., a separate model for
each nucleotide position in the codon. - DNA region gets 7 scores 6 reading frames
non-coding?high score wins. - Lukashin Borodovsky, Nucl. Acids Res. 26, 1998
is the HMM version.
20Interpolated Markov Models (IMM)
- Introduced in Glimmer 1.0Salzberg, Delcher,
Kasif White, NAR 26, 1998. - Probability of the target position depends on a
variable number of previous positions (sometimes
2 bases, sometimes 3, 4, etc.) - How many is determined by the specific context.
- E.g., for context ggtta the next position might
depend on previous 3 bases tta . - But for context catta all 5 bases might be
used.
ggtta
21Real IMMs
- Model has additional probabilities, ?, that
determine which parts of the context to use. - E.g., the probability of g occurring after
context atca is
22Real IMMs
- Result is a linear combination of different
Markov orderswhere - Can view this as interpolating the results of
different-order models. - The probability of a sequence is still the
probability of the bases in the sequence.
23Real IMMs
- Problem How to determine the ?s (or
equivalently the bjs)? - Traditionally done with EM algorithm using
cross-validation (deleted estimation). - Slow
- Hard to understand results
- Overtraining can be a problem
- We will cover EM later as part of HMMs
24Glimmer IMM
- Glimmer assumes
- Longer context is always better
- Only reason not to use it is undersampling in
training data. - If sequence occurs frequently enough in training
data, use it, i.e., - Otherwise, use frequency and ?2 significance to
set ?. - Interpolation is always between only 2 adjacent
model lengths.
25More Precisely
- Suppose context of length k1 occurs a times,
alt400, with target distribution D1 and context
of length k occurs 400 times, with target
distribution D2 - Let p be the ?2 probability that D1 differs from
D2 - Then the interpolated distribution will be
This formula computes the probability of a base B
at any position in a sequence, given the context
preceding that position This formula is
repeatedly invoked for all positions in an ORF
26IMMs vs Fixed-Order Models
- Performance
- IMM generally should do at least as well as a
fixed-order model. - Some risk of overtraining.
- IMM result can be stored and used like a
fixed-order model. - IMM will be somewhat slower to train and will use
more memory.
Variable-LengthContext
Target
27Interpolated Context Model (ICM)
- Introduced in Glimmer 2.0 Delcher, Harmon, et
al., Nucl. Acids Res. 27, 1999. - Doesnt require adjacent bases in the window
preceding the target position. - Choose set of positions that are most informative
about the target position.
Variable-PositionContext
Target
28ICM
- For all windows compare distribution at each
context position with target position - Choose position with max mutual information
Compare
29ICM
- Continue for windows with fixed base at chosen
positions - Recurse until too few training windows
- Result is a treedepth is of context positions
used - Then same interpolation as IMM, between node and
parent in tree
Compare
30Sample ICM Model
ver 2.00 len 12 depth 7 periodicity 3
nodes 21845 0 ----------? 0.0260
0.225 0.240 0.358 0.177 1
---------a? 0.0769 0.243 0.165 0.501
0.092 2 ---------c? 0.0243 0.237
0.294 0.264 0.205 3 ---------g?
0.0597 0.219 0.240 0.407 0.134 4
---------t? 0.0277 0.209 0.245 0.291
0.256 5 --------aa? 0.0051 0.320
0.202 0.478 0.000 6 --------ac?
0.0046 0.195 0.118 0.541 0.146 7
--------ag? 0.0101 0.180 0.207 0.613
0.000 8 --------at? 0.0037 0.236
0.147 0.400 0.217 9 --------ca?
0.0019 0.199 0.246 0.303 0.253 10
--------cc? 0.0090 0.297 0.189 0.308
0.207 11 --------cg? 0.0035 0.180
0.406 0.267 0.148 12 --------ct?
0.0083 0.279 0.327 0.189 0.206 13
--------ga? 0.0057 0.277 0.326 0.397
0.000 14 --------gc? 0.0057 0.233
0.158 0.473 0.136 15 --------gg?
0.0056 0.119 0.252 0.417 0.213 16
--------gt? 0.0096 0.207 0.232 0.359
0.203 17 --------ta? 0.0034 0.205
0.170 0.400 0.224 18 --------tc?
0.0062 0.179 0.238 0.290 0.293 19
--------tg? 0.0052 0.183 0.338 0.323
0.156 20 --------tt? 0.0076 0.244
0.244 0.194 0.318 21 -------aaa?
0.0035 0.376 0.199 0.425 0.000 22
-------caa? 0.0037 0.313 0.197 0.490
0.000 23 -------gaa? 0.0052 0.283
0.199 0.517 0.000 24 -------taa?
0.0060 0.259 0.231 0.510 0.000 25
-------aac? 0.0031 0.176 0.138 0.519
0.166
31ICM
- ICMs not just for modeling genes
- Can use to model any set of training sequences
- Doesnt even need to be DNA
- Have used recently to finding contaminant
sequences in shotgun sequencing projects. - Used a non-periodic (i.e., homogeneous) model
32Fixed-Length Sequences and ICMs
- Array of ICMs?a different one for each position
in fixed-length sequence. - Can model fixed regions around transcription
start sites or splice sites. - Positions in region can be permuted.
33Overlapping Orfs
- Glimmer1 2 used rules.
- For overlapping orfs A and B, the overlap region
AB is scored separately - If AB scores higher in As reading frame, and A
is longer than B, then reject B. - If AB scores higher in Bs reading frame, and B
is longer than A, then reject A. - Otherwise, output both A and B with a
suspicious tag. - Also try to move start site to eliminate
overlaps. - Leads to high false-positive rate, especially in
high-GC genomes.
34Glimmer 2.0 Overlap Comments
- OrfID Start End Frame Len Score
Comments - 2 499 1689 1 L1191 r-1.151
ShadowedBy 3 - 3 1977 418 -1 L1560 r-1.317
Contains 2 LowScoreBy 4 L257 S0 - 4 1721 2611 2 L 891 r-1.156
OlapWith 3 L257 S99 - 5 2624 3775 2 L1152 r-1.184
- 7 3775 4356 1 L 582 r-1.223
DelayedBy 5 L216 - 8 4501 6615 1 L2115 r-1.181
OlapWith 9 L2103 S99 - 9 6633 4513 -1 L2121 r-1.365
LowScoreBy 8 L2103 S0 - 10 6648 9173 3 L2526 r-1.155
- 11 9229 10008 1 L 780 r-1.217
- 12 10411 11208 1 L 798 r-1.192
- 13 11215 12243 1 L1029 r-1.165
ShorterThan 15 L865 S99 - 15 12827 11379 -3 L1449 r-1.295
LowScoreBy 13 L865 S0 LowScoreBy 16 - 16 12243 13298 3 L1056 r-1.228
OlapWith 15 L585 S99 DelayedBy 13 L - 17 13298 14137 2 L 840 r-1.161
- 18 15143 14133 -3 L1011 r-1.230
- 19 15193 15519 1 L 327 r-1.284
- 20 15519 17249 3 L1731 r-1.175
- 21 17249 19015 2 L1767 r-1.266
DelayedBy 20 L1263
35Glimmer3
- Uses a dynamic programming algorithm to choose
the highest-scoring set of orfs and start sites. - Not quite an HMM
- Allows small overlaps between genes
- small is user-defined
- Scores of genes are not necessarily
probabilities. - Score includes component for likelihood of start
site
36Reverse Scoring
- In Glimmer3 orfs are scored from 3 end to 5
end, i.e., from stop codon back toward start
codon. - Helps find the start site.
- The start should appear near the peak of the
cumulative score in this direction. - Keeps the context part of the model entirely in
the coding portion of gene, which it was trained
on.
37Reverse Scoring
38Finding Start Sites
- Can specify a position weight matrix (PWM) for
the ribosome binding site. - Used to boost the scoreof potential start sites
that have a match. - Would like to try
- A fixed-length ICM model of start sites.
- A decision-tree model of start sites.
- Hard to get enough reliable data.
- Glimmer2 had a hybridization-energy calculation
with 16sRNA tail sequence that didnt work well.
39Glimmer3 vs. Glimmer2
Table 3. Probability models trained on the output
of the long-orfs program. Predictions compared to
genes with annotated function. Table 4.
Probability models trained on the output of the
long-orfs program. Predictions compared to all
annotated genes.
40Other Glimmer3 Features
- Can specify any set of start or stop codons
- Including by NCBI translation table
- Can specify list of guaranteed genes
- Or just orfs and let Glimmer choose the starts
- Can specify genome regions to avoid
- Guarantee no prediction will overlap these
regions - Useful for RNA genes (tRNAs, rRNAs)
- Output more meaningful orf scores.
- Allow genes to extend beyond end of sequence
- important for annotation of incomplete genomes
41Glimmer3 Output
- ----- Start ----- --- Length
---- -------------- Scores ------------ - ID Frame of Orf of Gene Stop of Orf of
Gene Raw Frame 1 2 3 -1 -2 -3 NC - ... ... ... ... ...
... ... ... - -2 10516 10459 10337 177
120 1.14 0 - - 99 - 0 - 0 - -3 10787 10781 10629 156
150 -10.69 0 - - 99 - - 0 0 - -3 11132 11120 11004 126
114 -13.54 0 - - 99 - - 0 0 - -2 12058 12055 11936 120
117 -2.25 0 - - 99 - 0 - 0 - -3 12437 12371 12249 186
120 -13.68 0 - - 99 0 - 0 0 - 0007 3 8109 8142 12632 4521
4488 16.05 99 - - 99 - - - 0 - 0008 -2 12763 12724 12545 216
177 4.17 99 - - - - 99 - 0 - 1 12850 12877 12996 144
117 -16.13 0 0 - 99 - - - 0 - -2 13165 13036 12905 258
129 -1.31 0 - - 99 - 0 - 0 - -3 13649 13634 13473 174
159 -7.11 0 - - 99 - 0 0 0 - -2 13675 13663 13256 417
405 2.83 0 - - 99 - 0 - 0 - -2 13972 13972 13850 120
120 4.25 0 - - 99 - 0 - 0 - 0009 3 12633 12642 14087 1452
1443 15.20 99 - - 99 - - - 0 - 1 14194 14323 14454 258
129 1.62 0 0 - - - - 99 0 - 0010 -3 14669 14663 14088 579
573 13.36 99 - - - - - 99 0 - 1 14482 14548 14673 189
123 3.07 0 0 - - - - 99 0
42Finding Initial Training Set
- Glimmer1 2 have program long-orfs
- It finds orfs longer than min-length that do not
overlap other such orfs. - Current version automatically finds min-length.
- Works OK for low- to medium-GC genomes
- Terrible for high-GC genomes
- Up to half the orfs can be wrong.
- Uses first possible start site?even if orf is
correct much of sequence isnt. - Can use codon composition information in Glimmer3
version (similar to MED).
43(No Transcript)
44(No Transcript)
45Running Glimmer3
1 Find long, non-overlapping orfs to use as a
training set long-orfs --no_header -t 1.0
tpall.1con tp.longorfs 2 Extract the training
sequences from the genome file extract --nostop
tpall.1con tp.longorfs gt tp.train 3 Build the
icm from the training sequences build-icm -r
tp.icm lt tp.train 4 Run first Glimmer glimmer3
-o50 -g110 -t30 tpall.1con tp.icm tp.run1 5 Get
training coordinates from first predictions tail
2 tp.run1.predict gt tp.coords 6 Create a
position weight matrix (PWM) from the regions
upstream of the start locations in
tp.coords upstream-coords.awk 25 0 tp.coords
extract tpall.1con - gt tp.upstream elph
tp.upstream LEN6 get-motif-counts.awk gt
tp.motif 7 Determine the distribution of
start-codon usage in tp.coords set startuse
start-codon-distrib --3comma tpall.1con
tp.coords 8 Run second Glimmer glimmer3 -o50
-g110 -t30 -b tp.motif -P startuse tpall.1con
tp.icm tp
46A novel application of Glimmer
- P. didemni is a photosynthetic microbe that lives
as an endosymbiont in the sea squirt gt patella - P. didemni can only be cultured in L. patella
cells
L. patella (sea squirt)
47A novel application of Glimmer
- Generated 82,337 shotgun reads
- Bacterial genome 5 Mb
- Host genome estimated at 160 Mb
- Depth of coverage therefore much greater for
bacterial contigs - Singleton reads primarily belong to host
48A novel application of Glimmer
- Create training sets by classifying reads from
scaffolds gt 10kb as bacterial - 36,920 reads
- Reads where both read and mate were singletons
were treated as sea squirt - 21,276 reads
- 21,141 reads unclassified
49A novel application of Glimmer
- Train a non-periodic IMM on both sets of data
- 2 IMMs created
- Then classify reads using the ratio of scores
from the two IMMs - In a 5-fold cross-validation, classification
accuracy was - 98.9 on P. didemni reads
- 99.9 on L. patella reads
50A novel application of Glimmer
- Finally, re-assemble using ONLY bacterial reads
- Original assembly
- 65 scaffolds of 20 Kb or longer
- total scaffold length 5.74 Mb
- Improved assembly
- 58 scaffolds of 20 Kb or longer
- total scaffold length 5.84 Mb
51Acknowledgements
- Art Delcher
- Steven Salzberg
- Owen White (TIGR)Simon Kasif (Boston U.)Doug
Harmon (Loyola College) - Kirsten BratkeEdwin Powers (Johns Hopkins U.)
- Dan Haft (TIGR)
- Bill Nelson (TIGR)