Gene finding pipelines for automatic annotation of new

About This Presentation

Title:

Gene finding pipelines for automatic annotation of new

Description:

Gene finding pipelines for automatic annotation of new eukaryotic and bacterial genomes Victor Solovyev Professor of computer science, Royal Holloway, University of ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 61

Provided by: Victo154

Category:

more less

Transcript and Presenter's Notes

Title: Gene finding pipelines for automatic annotation of new

1
Gene finding pipelines for automatic annotation
of new eukaryotic and bacterial genomes

Victor Solovyev
Professor of computer science, Royal Holloway,
University of London
Chairman, Softberry Inc.

2
New genomes sequencing
Human, Mouse, Rat, Cow, Sheep, Cat, Dog, Pig,
Chicken, Drosophila, Bee, Zebrafish, Fugu,
Nematodes
Arabidopsis, Rice, Medicago, Soybean, Barley,
Poplar, Tomato, Oat, Wheat, Corn
S.cerevisiae, S.pombe, Aspergillus
nidulans,Coprinus cinereusCryptococcus
neoformans,Fusarium graminearumMagnaporthe
griseaNeurospora crassaUstilago maydis
Anopheles, P. falciparum, E. cuniculi, Chlamy,
Ciona, Diatom, White rot, P. sojae
Bacterial bacterial communities
3
Expression stages and structural organization of
typical eukaryotic protein-coding gene
4
Ab initio multiple gene prediction approaches
Genescan (Burge, Karlin,1997) HMMgene (Krogh,
1977) Fgenesh (Salamov, Solovyev,1998) Genie
(Reese et al., 2000)
GeneID (Guigo at al. 1992) Neural networks

Probabilistic

Pattern recognition

Fgenes (Solovyev,1997)
Discriminant functions
Likelihoods of gene components, HMM
Flexible combinations of any discriminative
features
Balanced score as production of likelihoods,
simple features
5
Hidden Markov model of multiple eukaryotic
genes Used in Genescan and Fgenesh programs
Ei and Ii are different exon and intron states,
respectively (i0,1,2 reflect 3 possible
different ORF). E 5/3 marks non-coding exons
and I5/I3 are 5- and 3-introns adjacent to
non-coding exons.
6
(No Transcript)
7
Signal differences start of translation
8
Signal differences donor splice site
9
Importance of good specific parameters Rice
example Fgenesh with Monocot gene-finding
parametres.
10
Strategy to make gene-finding parametersfor new
genomes

Using GeneBank genes for close organisms
Using new genomic sequence
a) Having known mRNA/cDNA sequences
Map mRNA by EST_MAP program on
genomic sequence
Extract genes and use them as learning set
b) Using ab initio gene-prediction
Predict genes, Select genes
with protein support
c) Using a database of known proteins (NR)
that can be mapped on genome by Prot_map program
with reconstructing gene-structure
In addition find protein coding ORF by BESTORF
program in a set of ESTs and use them in
learning of coding parameters

11
Learning parameters using GeneBank genes for
close organisms
Select GeneBank organism class having enough
known genes Create Infogene database with
reconstructed genes running Infog program (some
genes might be described in several GeneBank
entries) Run GetGenes program to extract genes
from Infogen to use in learning programs (with
cleaning genes with errors in annotation in ORF
and splice sites) Run Efeature program to
create set of coding regions (usually
significantly bigger than set of genes) set of
non-coding regions Run scripts/programs of
learning coding parameters (might be several GC
zones) Run scripts/programs of learning splice
sites parameters Run scripts/programs to create
exon length distributions and other
statistics Check parameters of initial
probabilities (exons/introns/noncoding)
depending on gene density in genome and
gene structure Test and edit parameters to
select the best variant. Repeat learning on
bigger or smaller organism classes and select the
best learning set.
12
Developed parametersfor fgenesh group of
programs

Human, Mouse, Drosophila, C. elegans, Fish
(WUSTL, Baylor, CSHL, JGI)
Dicots (Arabidopsis), Nicotiana tabacum,
Monocots (Corn, Rice, Wheat, Barley) (TIGR,
Rutgers University)
Algae, Plasmodium falciparum, Anopheles gambiae
Schizosaccharomyces pombe, Neurospora crassa,
Aspergillus nidulans, Coprinus cinereus,
Cryptococcus neoformans, Fusarium graminearum,
Magnaporthe grisea, Ustilago maydis (MIT/Broad
Institute)
Medicago (University of Minnesota)
Brugie malayi (TIGR)

13
FGENESH AUTOMATIC EUKARYOTIC GENOMEANNOTATION
PIPELINE

RefSeq mRNA mapping by Est_map program - mapped
genes are excluded from further gene prediction
process.
Map all known proteins (NR) on genome by Prot_map
program with gene structure reconstruction (find
regions occupied by genes)
Run Fgenesh using mapped proteins and selected
genome sequences
Run ab initio Fgenesh gene prediction on the rest
of genome.
Search for protein homologs (by BLAST) of all
products of predicted genes in NR.
Run Fgenesh gene prediction on sequences (from
stage 4) having protein homologs.
Second run of Fgenesh in regions free from genes
selected on stages 1,3,5.
Run of Fgenesh gene predictions in large introns
of known and predicted genes.
Special variant of FGENESH can take into
account synteny (human-mouse, for example) using
FGENESH-2 program that predicts genes using 2
similar genomic sequences from different species.

14
Components of Fgenesh automatic pipeline
Gene-finding group of program have mostly common
components and working with the same
organism-specific parameters
Fgenesh ab initio gene prediction. Run on whole
chromosomes (300MB). FAST The Human genome of
3 GB sequences is processed for 4
hours Fgenesh This derivative of Fgenesh uses
information on homologous proteins to improve
accuracy of gene prediction, if such homologs can
be found. Fgenesh-2 Variant of Fgenesh that uses
homology between two genomic DNAsequences, such
as human and mouse, as an extra factor for more
accurate gene prediction. Fgenesh_C uses
information on homologous mRNA/EST to improve
accuracy of gene prediction. Can be used to
reconstruct alternatively spliced genes.
15
Components of Fgenesh automatic pipeline
Programs for mapping known mRNA/Est or proteins
with reconstruction of gene structure
Est_map a program for fast mapping of a set of
mRNAs/ESTs to a chromosome sequence. It takes
into account splice site weight matrices for
accurate mapping (important for accurate mapping
very small exons). Prot_map is used for fast
mapping a database of protein sequences to genome
with accounting for splice sites (useful for
genomes with a few known genes and to search for
pseudogenes).
16
Example of Prot_map mapping of a protein sequence
to genome
First sequence Chr19 cut1 3000000 DD
Sequence 1( 1), S 52.623,
L1739 IPIIPI00170643.1SWISS-PROTQ8TEK3-1 Summ
of block lengths 1468, Alignment bounds On
first sequence start 2146727, end 2167939,
length 21213 On second sequence start 263,
end 1739, length 1477 Blocks of alignment
19 1 E 2146727 70 ca GT P
2146727 263 L 23, G 101.313, W
1160, S14.1355 2 E 2147573 107 AG GT
P 2147575 287 L 35, G 102.892, W
1810, S17.7256 3 E 2148934 42 AG GT
P 2148934 322 L 14, G 102.539, W
720, S11.1699 4 E 2150399 111 AG GT
P 2150399 336 L 37, G 101.777, W
1880, S18.0157 5 E 2150620 235 AG GT
P 2150620 373 L 78, G 101.251, W
3930, S26.0143 6 E 2151098 114 AG GT
P 2151100 452 L 37, G 105.778, W
2000, S18.7669 7 E 2151750 92 AG GT
P 2151752 490 L 30, G 101.188, W
1510, S16.1227 8 E 2153538 102 AG GT
P 2153538 520 L 34, G 100.414, W
1690, S17.0246 9 E 2153848 138 AG GT
P 2153848 554 L 46, G 99.168, W
2240, S19.5414 10 E 2154470 126 AG GT
P 2154470 600 L 42, G 101.071, W
2110, S19.0531 11 E 2156280 485 AG GT
P 2156280 642 L 161, G 102.616, W
8290, S37.9091 12 E 2156954 136 AG GT
P 2156955 804 L 45, G 103.244, W
2340, S20.1719 13 E 2157771 147 AG GT
P 2157771 849 L 49, G 98.511, W
2360, S20.0267 14 E 2160107 115 AG GT
P 2160107 898 L 38, G 100.777, W
1900, S18.0672 15 E 2161975 584 AG GT
P 2161977 937 L 194, G 101.031, W
9740, S40.932 16 E 2163280 206 AG GC
P 2163280 1131 L 68, G 103.135, W
3530, S24.7691 17 E 2165387 65 AG GT
P 2165388 1200 L 21, G 98.427, W
1010, S13.0987 18 E 2166182 945 AG GT
P 2166184 1222 L 314, G 102.034, W
16020, S52.6232 19 E 2167736 608 AG ta
P 2167738 1538 L 202, G 104.624, W
10730, S43.3437
17
Prot_map example of alignment
1 11 2146713 2146723
2146739 2146769 gatcacagaggctgg(..)agt
gtctgtgtttca?GGRIVSSKPFAPLNFRINSRNLSg
...............(..)evdhqlkerfanmke
GGRIVSSKPFAPLNFRINSRNLS- 248 248
249 259 267 277 2146797
2146806 2147558 2147568 2147581
2147611 gtaagaaactctcat(..)ctgtggctcctg
cagacIGTIMRVVELSPLKGSVSWTGK
---------------(..)---------------
-dIGTIMRVVELSPLKGSVSWTGK 286 286
286 286 289 299 2147641
2147671 2147686 2148919 2148926
2148937 PVSYYLHTIDRTIgtgagtatctcgctg(..
)ctttcttctttttagLENYFSSLKNP
PVSYYLHTIDRTI ---------------(..)---------------
LENYFSSLKNP 309 319 322
322 322 323 2148967 2148982
2150384 2150391 2150402 2150432
KLRgtaagtttgtgtgtt(..)ctgctctccttccagEEQEAARRRQQ
RESKSNAATP KLR ---------------(..)------
--------- EEQEAARRRQQRESKSNAATP 333
336 336 336 337 347
2150462 2150492 2150513 2150523 2150609
2150619 TKGPEGKVAGPADAPMgtaaggccccagcct
(..)ccttgtgtcctccagDSGAEEEK
TKGPEGKVAGPADAPM ---------------(..)--------------
- DSGAEEEK 357 367 373
373 373 373 2150644 2150674
2150704 2150734 2150764 2150794
AGAATVKKPSPSKARKKKLNKKGRKMAGRKRGRPKKMNTANPERKPKKNQ
TALDALHAQT AGAATVKKPSPSKARKKKLNKKGRKMAGR
KRGRPKKMNTANPERKPKKNQTALDALHAQT
18
Analysis of gene-finding accuracy and running time
Test on 83 small (lt 20 000 bp) human genes using
mouse homolog Prot_map Sne 73.7 Sn_pe- 93.3
Spe- 71.3 Sn_n 93.9 Sp_n 88.6 C0.9015
Time 1 min Genewise Sne 76.4 Sn_pe- 93.9
Spe- 76.4 Sn_n 94.9 Sp_n89.4 C0.9116
Time 90 min Fgenesh Test on 8 big (gt 400 000
bp) human genes using mouse homolog Prot_map
Sne 87.9 Sn_pe- 96.0 Spe- 81.3 Sn_n 94.3 Sp_n
96.0 C0.9514 Time 1 min Genewise Sne
91.9 Sn_pe- 97.0 Spe- 90.1 Sn_n 95.1 Sp_n 97.0
C0.9599 Time 1200 min Fgenesh Prot_map
mapping of Human protein set of 55946 proteins
on chromosome 19 (59 MB) takes 90 min
(best hit for each protein) and
148 min (all significant hits for each protein)
? Can be used for fast finding of an initial gene
set in new genome mapping all known proteins ?
Used for pseudogenes finding as mapping with
frameshifts damaging ORFs
19
New Fgenesh and Genewise
1) 700 genes with 6508 exons having similar
protein with gt 90 similarity GeneWISE Sne
94.1 Sn_pe- 97.8 Spe- 96.0 Sn_n 98.9 Sp_n 99.6
C0.992 FGENESH Sne 96.9 Sn_pe 98.5 Spe
97.9 Sn_n 99.0 Sp_n 99.5 C0.992 2) 18 genes
with 116 exons having similar Drosophila protein
with identity 28-70 GeneWISE Sne 40.5 Sn_pe-
64.7 Spe- 62.7 Sn_n 68.3 Sp_n 99.7
C0.813 Fgenesh Sne 70.7 Sn_pe 84.5 Spe
82.0 Sn_n 84.8 Sp_n 96.9 C0.898 5-exon
Observed - 18 Predicted - 14 Correct - 2 (11 by
Fgenesh) Intr Observed - 80 Predicted - 43
Correct - 38 (59 by Fgenesh) 3-exon Observed -
18 Predicted - 14 Correct - 7 (12 by Fgenesh)
Run time Fgenesh 50 1000 times faster
than GeneWise
20
(No Transcript)
21
Automated Gene Calling atCenter for Genome
Research MIT
Sequencing 2003/2004 6 new yeast genomes
2004/2005 20 new yeast
genomes

Gene structures are predicted using a combination
of FGENESH, FGENESH, and GENEWISE (Sanger
Institute). the protein used in the previous had
gt90 amino acid identity to the translated genome
(cumulative across sub-alignments), then the
GENEWISE call, if valid, was favored over the
FGENESH call, and was used as the EVIDENCE_GENE
If this protein had gt80 but less than 90 amino
acid identity to the translated genome
(cumulative across sub-alignments), then the
FGENESH call, if valid, was favored over the
GENEWISE call, and was used as the EVIDENCE_GENE

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Examples of usage Fgenesh suit in genome
annotations

         Grimwood J, Gordon LA, Olsen A, ..,
Salamov A., Solovyev V., ..., Lukas S. (2004)
The DNA sequence and biology of human chromosome
19. Nature, 428(6982), 529-535. Using Fgenesh,
Fgenesh, est_map to annotate genes in Himan
cjromosome 19. annotation.          Heiliget
al. (2003) The DNA sequence and analysis of human
chromosome 14. Nature 421, 601 - 607. FGENESH
used for human chromosome 14 annotation.
         Hillier et al. (2003) The DNA sequence
of human chromosome 7. Nature 424, 157 - 164.
Extensive use of FGENESH-2 for human chromosome 7
annotation.          Feng et al. (2002)
Sequence and analysis of rice chromosome 4.
Nature 420, 316 - 320. FGENESH used for
annotation of rice chromosome 4.
Galagan et al. (2003) The genome sequence
of the filamentous fungus Neurospora crassa.
Nature 422859-868. Neurospora genome annotation
based on FGENESH and FGENESH.          Lander
et al. (2001) Initial sequencing and analysis of
the human genome. Nature 409, 860 - 921. Original
paper on sequencing human genome by public
consortium also reports use of FGENESH genefinder
for genome annotation.          Deloukas et al.
(2001) The DNA sequence and comparative analysis
of human chromosome 20. Nature 414, 865-871. Use
of FGENESH for annotation of human chromosome 20.
         Yu et al. (2002) A draft sequence of
the rice genome (Oryza sativa L. ssp. indica).
Science 29679-92. Rice genome sequencing and
annotation project used FGENESH as primary source
of gene predictions.
Holt et al. (2002) The Genome Sequence
of the Malaria Mosquito Anopheles gambiae.
Science 298 129-149. Use of FGENESH for
annotation of Anopheles genome.

27
Canonical and Non-canonical splice sites
SpliceDB (Burset, Seledtsov, Solovyev, NAR
1999,2000)
GT-AG 99.24 GC-AG 0.69 AT-AC 0.05
other sites 0.02

GT-AG group (canonical splice sites) 22199
examples
M70A60G80GTR95A71G81T46
Y73Y75Y78Y79Y80Y79Y78Y81Y86Y86NC71AGG52

b) GC-AG group 126 examples M83A89G98GCA87A84G97
T71
c) AT-AC group 8 annotated examples 2
examples recovered from annotation
errors S90ATA100T100C100C100T100T90T70 T70G50C70
NC60ACA60T60
Gene prediction is usually done with only
standard splice sites
28
Additional sources of genes

Identified with synteny data help
Non canonical splice sites
Alternatively spliced
Alternative promoters
Alternative poly-A

Additional studies of the above topics will
update the current gene collections
29
(No Transcript)
30
Exon-based syntheny

Run Gene-finding annotation pipeline for each
genome
Select chains of similar exons between 2 genomes
comparing coding exons by Blast

95 in agreement with filtered genome alignments
Brudno et al.(2004) Automated Whole-Genome
Multiple Alignment of Rat, Mouse, and Human
Genome Research Journal, 14(4)685-692.
31
(No Transcript)
32
Pseudogene finding using Prot_map
54408308 54408560 54408568 54408581 54408611
54408641 nnnnnnnn(..)nnnnnnnnnnnnnagKE
FDFESANAQFNKEEMGREFHNKLKLKEDKL
--------(..)--------------- KDFDFESANAQFNKEEIDREFH
NKLKLKEDKL 134 134 134
136 146 156 54408671 54408701
54408731 54408761 54408791 54408821
EKEEKPVNGEDKGDSGVDTQNSEGHADEEDALGPNCFYDQTKSSFDNISG
DDNRERRPTW EKQEKPVNGEDKGDSGVDTQNSEGNADE
EDPLGPNCYYDKTKSFFDNISCDDNRERRPTW 166
176 186 196 206 216
54408851 54408881 54408911 54408939 54408967
54408976 AEGRRLNAETFGIPLCPNRGHGGYRGRGa
GLGFHGGRGRggtggcagaagtggta(..)
AEERRLNAETFGIPLRPNRGRGGYRGRG-GLGFRGGRGR-
---------------(..) 226 236
246 255 264 264
33
Pseudogene finder

Generation pseudogene candidates
Run script finding genes having almost identical
coding proteins (or part of them) with lesser
number introns (or without introns).
Run prot_map mapping Human (mammalian) proteins
and selecting damaged ones
Selecting pseudogenes using additional features
like poly_A tail, ratio ks/kn

34
(No Transcript)
35
Development of eukaryotic promoter recognizer In
group of TSS programs
36
Results of promoter search on genes with known
mRNAs by different promoter-finding programs.
Reproduced from Liu and States (2002) Genome
Research 12462-469.
37
(No Transcript)
38
Accuracy of prediction by TSSP on plant genomic
sequences
Selected known genomic regions upstream of
CDS True positives
92 Total number of False positives for 40 TATA
promoters 22 (1 per 3648 bp)
True positives 95 Total number of
False positives for 25 TATA less promoters 15
(1 per 3300 bp) For every class (TATA and
TATA-less) promoters only one predicted TSS with
highest score in an interval of 300 bp was taken
during the search.
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
PromH with ortologous sequences
43
Fgenesb_annotator - Bacterial Gene/Operon
Prediction and Annotation Pipeline
FGENESB is a new complex package for
annotation of bacterial genomes. Its gene
prediction algorithm is based on Markov chain
models of coding regions and translation and
termination sites. Operon models are based on
distances between ORFs, frequencies of different
genes neighboring each other in known bacterial
genomes, predicted promoters and terminators
The parameters of gene prediction are
self-learning, so the only input necessary for
annotation of new genome is a sequence.
44
Fgenesb accuracy on difficult sets
45
rRNA and tRNA annotation
STEP 1. Finds all potential ribosomal RRNA
genes using BLAST against bacterial and/or
archaeal RRNA databases. and masks detected RRNA
genes. STEP 2. Predicts tRNA genes using
tRNAscan-SE program. Inside bactg_ann.pl - run
tRNAscan-SE and masks detected TRNA genes .
46
Genes and Operon identification
STEP 3. Initial predictions of long, slightly
overlapping ORF that are used as a starting point
for calculating parameters of predictions.
Iterates until stabilizes. Generates parameters
such as 5th-order in-frame Markov chains for
coding regions, 2nd-order Markov models for
region around start codon and upstream RBS site,
Stop codon and probability distributions of ORF
lengths. Protein coding genes prediction STEP
4. it predicts operons based only on distances
between predicted genes.
47
Annotate genes comparing with databases of known
proteins
STEP 5. Runs blastp for predicted proteins
against COG database- cog.pro and annotate by
COGs descriptions STEP 6. Run blastp against
NR for proteins having no COGs hits And annotate
by NR descriptions.
48
Promoters and Terminators prediction and
improvement of operons assignment
STEP 7. Uses information about conservation of
neighbor gene pairs in known genomes to improve
operon prediction. STEP 8. predicts potential
promoters (tssb) and terminators (bterm) in the
corresponding 5'-upstream and 3'-downstream
regions of predicted genes. Tssb- bacterial
promoter prediction (sigma70), using dicriminant
function with characteristics of sequence
features of promoters (such as conserved motifs,
binding sites and etc) Bterm - prediction of
pho-independent terminators as hairpins, with
energy scoring based on discriminant function of
hairpin elements. STEP 9. refines operon
predictions using predicted promoters and
terminators as additional evidences.
49
Fgenesb_annotator output
1 1 Op 1 21/0.000 CDS
407 - 1747 1311 COG0593 ATPase
involved in DNA
Term 1786 - 1823 3.2
Prom 1847 -
1906 10.5 2 1 Op 2 3/0.019
CDS 1926 - 3065 1237 COG0592 DNA
polymerase
Term 3074 - 3122 9.1
Prom 3105 - 3164
4.0 3 2 Op 1 4/0.002 CDS
3193 - 3405 278 COG2501
Uncharacterized ACR 4 2 Op 2 4/0.002
CDS 3418 - 4545 899
COG1195 Recombinational DNA 2 Op 3
16/0.000 CDS 4578 - 6506 2148
COG0187 DNA gyrase (topoisomerase II) B
subunit Term
6516 - 6551 4.7
Prom 6512 - 6571 2.3
6 2 Op 4 . CDS
6595 - 9066 2957 COG0188 DNA gyrase
(topoisomerase II) A subunit
Term 9067 - 9098 3.4
SSU_RRNA
9308 - 10861 100.0 AY138279 D1..1554
16S ribosomal RNA Bacillus cereus
TRNA 10992 - 11068
101.2 Ile GAT 0 0
TRNA 11077 - 11152 93.9 Ala
TGC 0 0 LSU_RRNA
11233 - 14154 99.0 AF267882
D1..2922 23S ribosomal RNA Bacillus 7
3 Op 1 . - CDS 14175 -
14363 158
5S_RRNA 14205 - 14315 97.0 AE017026
D165635..165750 5S ribosomal RNA Bacillus
8 3 Op 2 . - CDS 14353
- 15249 351 Similar_to_GB 9 3
Op 3 . - CDS 15170 - 15352
99 - Prom
15373 - 15432 6.9
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
Comparison of 2 bacterial genomes
GenomMatch aligns 2 bacterial genomes 2 MB x 2MB
30 sec
55
(No Transcript)
56
Figure1
57
(No Transcript)
58
Nature (2004) 428 (6978) , p. 37 43
59
Annotation of new bacteria New drugs Annotation
of bacterial communities DNA from Specific
sources (not growing in Labs) Oceans/Acid
mines/agriculture (with mix of 100s species) New
ferments
60