Bacterial Gene Finding and Glimmer also Archaeal and viral gene finding - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Bacterial Gene Finding and Glimmer also Archaeal and viral gene finding

Description:

DNA region gets 7 scores: 6 reading frames & non-coding?high score wins. ... Art Delcher. Steven Salzberg. Owen White (TIGR) Simon Kasif (Boston U. ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 52

Provided by: Adma3

Category:

more less

Transcript and Presenter's Notes

Title: Bacterial Gene Finding and Glimmer also Archaeal and viral gene finding

1
Bacterial Gene Finding and Glimmer(also Archaeal
and viral gene finding)

Arthur L. Delcher and Steven SalzbergCenter for
Bioinformatics and Computational Biology
University of Maryland

2
Outline

A (very) brief overview of microbial gene-finding
Codon composition methods
GeneMark Markov models
Glimmer1 2
Interpolated Markov Model (IMM)
Interpolated Context Model (ICM)
Glimmer3
Reducing false positives
Improving coding initiation site predictions
Running Glimmer3

3
Step One

Find open reading frames (ORFs).

TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA
Stopcodon
Stopcodon
4
Step One

Find open reading frames (ORFs).
But ORFs generally overlap

Reversestrand
Stopcodon
ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT
TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA
Stopcodon
Stopcodon
ShiftedStop
5
Campylobacter jejuni RM1221 30.3GC
All ORFs on both strands shown - color indicates
reading frame Longest ORFs likely to be
protein-coding genes Note the low GC content
6
Campylobacter jejuni RM1221 30.3GC
Purple lines are the predicted genes
Purple ORFs show annotated (true) genes
7
Campylobacter jejuni RM1221 30.3GC
Mycobacterium smegmatis MC2 67.4GC
Note what happens in a high-GC genome
8
Campylobacter jejuni RM1221 30.3GC
Mycobacterium smegmatis MC2 67.4GC
Purple lines show annotated genes
9
The Problem

Need to decide which orfs are genes.
Then figure out the coding start sites
Can do homology searches but that wont find
novel genes
Besides, there are errors in the databases
Generally can assume that there are some known
genes to use as training set.
Or just find the obvious ones

10
Codon Composition

Find patterns of nucleotides in known coding
regions (assumed to be available).
Nucleotide distribution at 3 codon positions
Hexamers
GC-skew
(G-C)/(GC) computed in windows of size N
Amino-acid composition
Use these to decide which orfs are genes.
Prefer longer orfs
Must deal with overlaps

11
Bacterial Replication
Early replication
Theta structure
12
Termination of Replication
B. subtilis
E. coli
13
Borrelia burgdorferi (Lyme disease
pathogen)GC-skew plot
14
Codon Composition

Nucleotide variation at codon position

15
Codon-Composition Gene Finders

ZCURVE
Guo, Ou Zhang, NAR 31, 2003
Based on nucleotide and di-nucleotide frequency
in codons
Uses Z-transform and Fisher linear discriminant
MED
Ouyang, Zhu, Wang She, JBCB 2(2) 2004
Based on amino-acid frequencies
Uses nearest-neighbor classification on
entropies

16
Probabilistic Methods

Create models that have a probability of
generating any given sequence.
Train the models using examples of the types of
sequences to generate.
The score of an orf is the probability of the
model generating it.
Can also use a negative model (i.e., a model of
non-orfs) and make the score be the ratio of the
probabilities (i.e., the odds) of the two models.
Use logs to avoid underflow

17
Fixed-Order Markov Models

k th-order Markov model bases the probability of
an event on the preceding k events.
Example With a 3rd-order model the probability
of this sequencewould be

Target
Target
18
Fixed-Order Markov Models

Advantages
Easy to train. Count frequencies of (k1)-mers
in training data.
Easy to compute probability of sequence.
Disadvantages
Many (k1)-mers may be undersampled in training
data.
Models data as fixed-length chunks.

Fixed-LengthContext
Target
19
GeneMark

Borodovsky McIninch, Comp. Chem 17, 1993.
Uses 5th-order Markov model.
Model is 3-periodic, i.e., a separate model for
each nucleotide position in the codon.
DNA region gets 7 scores 6 reading frames
non-coding?high score wins.
Lukashin Borodovsky, Nucl. Acids Res. 26, 1998
is the HMM version.

20
Interpolated Markov Models (IMM)

Introduced in Glimmer 1.0Salzberg, Delcher,
Kasif White, NAR 26, 1998.
Probability of the target position depends on a
variable number of previous positions (sometimes
2 bases, sometimes 3, 4, etc.)
How many is determined by the specific context.
E.g., for context ggtta the next position might
depend on previous 3 bases tta .
But for context catta all 5 bases might be
used.

ggtta
21
Real IMMs

Model has additional probabilities, ?, that
determine which parts of the context to use.
E.g., the probability of g occurring after
context atca is

22
Real IMMs

Result is a linear combination of different
Markov orderswhere
Can view this as interpolating the results of
different-order models.
The probability of a sequence is still the
probability of the bases in the sequence.

23
Real IMMs

Problem How to determine the ?s (or
equivalently the bjs)?
Traditionally done with EM algorithm using
cross-validation (deleted estimation).
Slow
Hard to understand results
Overtraining can be a problem
We will cover EM later as part of HMMs

24
Glimmer IMM

Glimmer assumes
Longer context is always better
Only reason not to use it is undersampling in
training data.
If sequence occurs frequently enough in training
data, use it, i.e.,
Otherwise, use frequency and ?2 significance to
set ?.
Interpolation is always between only 2 adjacent
model lengths.

25
More Precisely

Suppose context of length k1 occurs a times,
alt400, with target distribution D1 and context
of length k occurs 400 times, with target
distribution D2
Let p be the ?2 probability that D1 differs from
D2
Then the interpolated distribution will be

This formula computes the probability of a base B
at any position in a sequence, given the context
preceding that position This formula is
repeatedly invoked for all positions in an ORF
26
IMMs vs Fixed-Order Models

Performance
IMM generally should do at least as well as a
fixed-order model.
Some risk of overtraining.
IMM result can be stored and used like a
fixed-order model.
IMM will be somewhat slower to train and will use
more memory.

Variable-LengthContext
Target
27
Interpolated Context Model (ICM)

Introduced in Glimmer 2.0 Delcher, Harmon, et
al., Nucl. Acids Res. 27, 1999.
Doesnt require adjacent bases in the window
preceding the target position.
Choose set of positions that are most informative
about the target position.

Variable-PositionContext
Target
28
ICM

For all windows compare distribution at each
context position with target position
Choose position with max mutual information

Compare
29
ICM

Continue for windows with fixed base at chosen
positions
Recurse until too few training windows
Result is a treedepth is of context positions
used
Then same interpolation as IMM, between node and
parent in tree

Compare
30
Sample ICM Model
ver 2.00 len 12 depth 7 periodicity 3
nodes 21845 0 ----------? 0.0260
0.225 0.240 0.358 0.177 1
---------a? 0.0769 0.243 0.165 0.501
0.092 2 ---------c? 0.0243 0.237
0.294 0.264 0.205 3 ---------g?
0.0597 0.219 0.240 0.407 0.134 4
---------t? 0.0277 0.209 0.245 0.291
0.256 5 --------aa? 0.0051 0.320
0.202 0.478 0.000 6 --------ac?
0.0046 0.195 0.118 0.541 0.146 7
--------ag? 0.0101 0.180 0.207 0.613
0.000 8 --------at? 0.0037 0.236
0.147 0.400 0.217 9 --------ca?
0.0019 0.199 0.246 0.303 0.253 10
--------cc? 0.0090 0.297 0.189 0.308
0.207 11 --------cg? 0.0035 0.180
0.406 0.267 0.148 12 --------ct?
0.0083 0.279 0.327 0.189 0.206 13
--------ga? 0.0057 0.277 0.326 0.397
0.000 14 --------gc? 0.0057 0.233
0.158 0.473 0.136 15 --------gg?
0.0056 0.119 0.252 0.417 0.213 16
--------gt? 0.0096 0.207 0.232 0.359
0.203 17 --------ta? 0.0034 0.205
0.170 0.400 0.224 18 --------tc?
0.0062 0.179 0.238 0.290 0.293 19
--------tg? 0.0052 0.183 0.338 0.323
0.156 20 --------tt? 0.0076 0.244
0.244 0.194 0.318 21 -------aaa?
0.0035 0.376 0.199 0.425 0.000 22
-------caa? 0.0037 0.313 0.197 0.490
0.000 23 -------gaa? 0.0052 0.283
0.199 0.517 0.000 24 -------taa?
0.0060 0.259 0.231 0.510 0.000 25
-------aac? 0.0031 0.176 0.138 0.519
0.166
31
ICM

ICMs not just for modeling genes
Can use to model any set of training sequences
Doesnt even need to be DNA
Have used recently to finding contaminant
sequences in shotgun sequencing projects.
Used a non-periodic (i.e., homogeneous) model

32
Fixed-Length Sequences and ICMs

Array of ICMs?a different one for each position
in fixed-length sequence.
Can model fixed regions around transcription
start sites or splice sites.
Positions in region can be permuted.

33
Overlapping Orfs

Glimmer1 2 used rules.
For overlapping orfs A and B, the overlap region
AB is scored separately
If AB scores higher in As reading frame, and A
is longer than B, then reject B.
If AB scores higher in Bs reading frame, and B
is longer than A, then reject A.
Otherwise, output both A and B with a
suspicious tag.
Also try to move start site to eliminate
overlaps.
Leads to high false-positive rate, especially in
high-GC genomes.

34
Glimmer 2.0 Overlap Comments

OrfID Start End Frame Len Score
Comments
2 499 1689 1 L1191 r-1.151
ShadowedBy 3
3 1977 418 -1 L1560 r-1.317
Contains 2 LowScoreBy 4 L257 S0
4 1721 2611 2 L 891 r-1.156
OlapWith 3 L257 S99
5 2624 3775 2 L1152 r-1.184
7 3775 4356 1 L 582 r-1.223
DelayedBy 5 L216
8 4501 6615 1 L2115 r-1.181
OlapWith 9 L2103 S99
9 6633 4513 -1 L2121 r-1.365
LowScoreBy 8 L2103 S0
10 6648 9173 3 L2526 r-1.155
11 9229 10008 1 L 780 r-1.217
12 10411 11208 1 L 798 r-1.192
13 11215 12243 1 L1029 r-1.165
ShorterThan 15 L865 S99
15 12827 11379 -3 L1449 r-1.295
LowScoreBy 13 L865 S0 LowScoreBy 16
16 12243 13298 3 L1056 r-1.228
OlapWith 15 L585 S99 DelayedBy 13 L
17 13298 14137 2 L 840 r-1.161
18 15143 14133 -3 L1011 r-1.230
19 15193 15519 1 L 327 r-1.284
20 15519 17249 3 L1731 r-1.175
21 17249 19015 2 L1767 r-1.266
DelayedBy 20 L1263

35
Glimmer3

Uses a dynamic programming algorithm to choose
the highest-scoring set of orfs and start sites.
Not quite an HMM
Allows small overlaps between genes
small is user-defined
Scores of genes are not necessarily
probabilities.
Score includes component for likelihood of start
site

36
Reverse Scoring

In Glimmer3 orfs are scored from 3 end to 5
end, i.e., from stop codon back toward start
codon.
Helps find the start site.
The start should appear near the peak of the
cumulative score in this direction.
Keeps the context part of the model entirely in
the coding portion of gene, which it was trained
on.

37
Reverse Scoring
38
Finding Start Sites

Can specify a position weight matrix (PWM) for
the ribosome binding site.
Used to boost the scoreof potential start sites
that have a match.
Would like to try
A fixed-length ICM model of start sites.
A decision-tree model of start sites.
Hard to get enough reliable data.
Glimmer2 had a hybridization-energy calculation
with 16sRNA tail sequence that didnt work well.

39
Glimmer3 vs. Glimmer2
Table 3. Probability models trained on the output
of the long-orfs program. Predictions compared to
genes with annotated function. Table 4.
Probability models trained on the output of the
long-orfs program. Predictions compared to all
annotated genes.
40
Other Glimmer3 Features

Can specify any set of start or stop codons
Including by NCBI translation table
Can specify list of guaranteed genes
Or just orfs and let Glimmer choose the starts
Can specify genome regions to avoid
Guarantee no prediction will overlap these
regions
Useful for RNA genes (tRNAs, rRNAs)
Output more meaningful orf scores.
Allow genes to extend beyond end of sequence
important for annotation of incomplete genomes

41
Glimmer3 Output

----- Start ----- --- Length
---- -------------- Scores ------------
ID Frame of Orf of Gene Stop of Orf of
Gene Raw Frame 1 2 3 -1 -2 -3 NC
... ... ... ... ...
... ... ...
-2 10516 10459 10337 177
120 1.14 0 - - 99 - 0 - 0
-3 10787 10781 10629 156
150 -10.69 0 - - 99 - - 0 0
-3 11132 11120 11004 126
114 -13.54 0 - - 99 - - 0 0
-2 12058 12055 11936 120
117 -2.25 0 - - 99 - 0 - 0
-3 12437 12371 12249 186
120 -13.68 0 - - 99 0 - 0 0
0007 3 8109 8142 12632 4521
4488 16.05 99 - - 99 - - - 0
0008 -2 12763 12724 12545 216
177 4.17 99 - - - - 99 - 0
1 12850 12877 12996 144
117 -16.13 0 0 - 99 - - - 0
-2 13165 13036 12905 258
129 -1.31 0 - - 99 - 0 - 0
-3 13649 13634 13473 174
159 -7.11 0 - - 99 - 0 0 0
-2 13675 13663 13256 417
405 2.83 0 - - 99 - 0 - 0
-2 13972 13972 13850 120
120 4.25 0 - - 99 - 0 - 0
0009 3 12633 12642 14087 1452
1443 15.20 99 - - 99 - - - 0
1 14194 14323 14454 258
129 1.62 0 0 - - - - 99 0
0010 -3 14669 14663 14088 579
573 13.36 99 - - - - - 99 0
1 14482 14548 14673 189
123 3.07 0 0 - - - - 99 0

42
Finding Initial Training Set

Glimmer1 2 have program long-orfs
It finds orfs longer than min-length that do not
overlap other such orfs.
Current version automatically finds min-length.
Works OK for low- to medium-GC genomes
Terrible for high-GC genomes
Up to half the orfs can be wrong.
Uses first possible start site?even if orf is
correct much of sequence isnt.
Can use codon composition information in Glimmer3
version (similar to MED).

43
(No Transcript)
44
(No Transcript)
45
Running Glimmer3
1 Find long, non-overlapping orfs to use as a
training set long-orfs --no_header -t 1.0
tpall.1con tp.longorfs 2 Extract the training
sequences from the genome file extract --nostop
tpall.1con tp.longorfs gt tp.train 3 Build the
icm from the training sequences build-icm -r
tp.icm lt tp.train 4 Run first Glimmer glimmer3
-o50 -g110 -t30 tpall.1con tp.icm tp.run1 5 Get
training coordinates from first predictions tail
2 tp.run1.predict gt tp.coords 6 Create a
position weight matrix (PWM) from the regions
upstream of the start locations in
tp.coords upstream-coords.awk 25 0 tp.coords
extract tpall.1con - gt tp.upstream elph
tp.upstream LEN6 get-motif-counts.awk gt
tp.motif 7 Determine the distribution of
start-codon usage in tp.coords set startuse
start-codon-distrib --3comma tpall.1con
tp.coords 8 Run second Glimmer glimmer3 -o50
-g110 -t30 -b tp.motif -P startuse tpall.1con
tp.icm tp
46
A novel application of Glimmer