The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster

About This Presentation

Title:

The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster

Description:

Reese et al., Tutorial #3, ISMB 99. The challenge of annotating a complete eukaryotic genome: ... TATA box. Initiator (Inr) Downstream promoter element (DPE) ... – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 183

Provided by: martin128

Learn more at: https://www.fruitfly.org

Category:

more less

Transcript and Presenter's Notes

Title: The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster

1
The challenge of annotating a complete eukaryotic
genomeA case study in Drosophila melanogaster

Martin G. Reese (mgreese_at_lbl.gov)
Nomi L. Harris (nlharris_at_lbl.gov)
George Hartzell (hartzell_at_cs.berkeley.edu)
Suzanna E. Lewis (suzi_at_fruitfly.berkeley.edu)
Drosophila Genome CenterDepartment of Molecular
and Cell Biology539 Life Sciences
AdditionUniversity of California, Berkeley

2
Abstract
Many of the technical issues involved in
sequencing complete genomes are essentially
solved. Technologies already exist that provide
sufficient solutions for ascertaining sequencing
error rates and for assembling sequence data.
Currently, however, standards or rules for the
annotation process are still an outstanding
problem. How shall the genomes be annotated,
what shall be annotated, which computational
tools are most effective, how reliable are these
annotations, how organism-specific do the tools
have to be and ultimately how should the
computational results be presented to the
community? All these questions are unsolved. This
tutorial will give an overview and assessment of
the current state of annotation based upon
experiences gained at the Drosophila melanogaster
genome project. In the tutorial we will do three
things. First, we will break down the annotation
process and discuss the various aspects of the
problem. This will serve to clarify the term
"annotation", which is often used to collectively
describe a process that has a number of discrete
steps. Second, with the participation of
computational biologists from the community we
will compare existing tools for sequence
annotation. We will do this by providing a 3
megabase sequence that has already been
well-characterized at our center as a testbed for
evaluating other feature-finding algorithms. This
is similar to what has been done at the CASP
(critical assessment of techniques for protein
structure prediction) conferences
(http//predictioncenter.llnl.gov) for protein
structure prediction. Third, we will discuss
which annotation problems are essentially solved
and which problems remain.

3
Tutorial goals

Review the algorithms currently used in
annotation
Assess existing methods under field conditions
Identify open issues in annotation

4
Tutorial organization

Definitions
Annotation
Biological issues
Engineering issues
Application of tools within an existing
annotation system
Break (20 minutes)
Review of existing tools
Our annotation experiment
Conclusions and outstanding issues

5
What is a gene?

Definition An inheritable trait associated with
a region of DNA that codes for a polypeptide
chain or specifies an RNA molecule which in turn
have an influence on some characteristic
phenotype of the organism.

6
What are annotations?

Definition Features on the genome derived
through the transformation of raw genomic
sequences into information by integrating
computational tools, auxiliary biological data,
and biological knowledge.

7
How does an annotation differ from a gene?

Many annotations are the same as genes
The annotation describes an inheritable trait
associated with a region of DNA.
But an annotation may not always correspond in
this way, e.g. an STS, or sequence overlap
Region of genomic DNA or RNA is not translated or
transcribed

8
Transcription and translation
9
Schematic gene structure
10
Sequence feature types

Transcribed region
mRNA, tRNA, snoRNA, snRNA, rRNA
Structural region
Exon, intron, 5 UTR, 3 UTR, ORF, cleavage
product
Mutations insertion, deletion, substitution,
inversion, translocation
Functional or signal region
Promoter, enhancer, DNA/RNA binding site, splice
site signal, poly-adenylation signal
Protein processing glycosylation, methylation,
phosphorylation site
Similarity
Homolog, paralog, genomic overlap (syntenic
region)
Other feature types
Transposable element, repetitive element
Pseudogene
STS, insertion site

11
DNA transcription unit features

Promoter elements
Core promoter elements
TATA box
Initiator (Inr)
Downstream promoter element (DPE)
Transcription factor (TF) binding sites
CAAT boxes
GC boxes
SP-1 sites
GAGA boxes
Enhancer site(s)

12
mRNA features

Exon
Initial, internal, terminal
Codon usage, preference
Control elements (e.g. splice enhancers)
Intron
5 splice site (GT), branchpoint (lariat), 3
splice site (AG)
Repeat elements
Start codon (translation start site)
Kozak rule
UTR (untranslated regions)
5 UTR
Translation regulatory elements
RNA binding sites
Initial, internal, terminal
Control elements (e.g. splice enhancers)
3 UTR
RNA binding sites (cis-acting elements)
Stop codon
Poly-adenylation signal and site

13
(No Transcript)
14
Definitions for data modeling

Feature An interval or an ordered set of
intervals on a sequence that describes some
biological attribute and is justified by
evidence.
Sequence A linear molecule of DNA, RNA or amino
acids.
Evidence A computational or experimental result
coming out of an analysis of a sequence
Annotation A set of features

15
Annotation
Annotated genome
Depth of knowledge
Breadth of knowledge
16
Annotation process overview
Methods
Data
Genome Sequence
Auxiliary Data
Computational Tools
Database Resources
Annotation Systems
Understanding of a Genome
17
Types of sequence data

Chromosomal sequence
Euchromatic
Heterochromatic
mRNA sequences
Full length cDNA
5 EST
3 EST
Protein sequences
Insertion site flanking sequences

18
Auxiliary data

Maps
Genetic, physical, radiation hybrid map (RH),
deletion, cytogenetic
Expression data
Tissue, stage
Phenotypes
Lethality, sterility

19
Computational annotation tools

Gene finding
Repeat finding
EST/cDNA alignment
Homology searching
BLAST, FASTA, HMM-based methods, etc.
Protein family searching
PFAM, Prosite, etc.

20
Database resources

Curated sequence feature data sets
Repeat elements
Transposons
Non-redundant mRNA
STSs and other sequence markers
Genome sequence from related species
D. melanogaster vs. D. virilis, D. hydei
Genome sequence from more distant species
Protein sequences from distant species

21
Biological issues in annotation

Common
Genes within genes
Alternative splicing
Alternative poly-adenylation sites
Rare
Translational frame shifting
mRNA editing
Eukaryotic operons
Alternative initiation

22
Engineering issues in annotation

What sequence to start with?
Because features are intervals on a sequence,
problems can be caused by gaps, frameshifts, and
other changes to the sequence. How do you track
these changes over time and model features that
span gaps?
When to annotate?
Feature identification can aid in sequencing. It
may be advisable to carry out sequencing and
annotation in parallel thus enabling them to
complement one another.
What analyses need to be run and how?
What dependencies are there between various
analysis programs?
What parameters settings to use?

23
Engineering issues in annotation

What public sequence data sets are needed?
What are the mechanics of obtaining public
sequence databases?
Are curated data sets available or do you need to
set up a means of maintaining your own (for
repeats, insertions, organism of interest)
How do you achieve computational throughput?
Workstation farm, or simply a big, powerful box?
Job flow control
What do you do with the results?
Homogenize results into single format?
Filter results for significance and redundancy

24
Engineering issues in annotation

Interpreting the results
Is human curation needed?
How can you achieve consistency between curators?
How do you design the user interface so that it
is simple enough to get the task completed
speedily but complex enough to deal with biology?
How do you capture curations?
How are annotation translations to be described?
EC terminology
ProSite families
Pfam domains
Is function distinguishable from process?

25
Engineering issues in annotation

How do you manage data?
What is the appropriate database schema design?
How is the database to be kept up to date? Will
it be directly from programs running user
interfaces and analyses or via a middleware
layer?
Is a flat file format needed and what should it
be?
What query and retrieval support is needed?
How do you distribute data?
For bulk downloads what is the format of the
data?
What information is best summarized in tables?
What information requires an integrated graphical
view?

26
Engineering issues in annotation

How do you update the annotations?
How frequently are they re-evaluated?
How can re-evaluation be minimized (only subsets
of the databanks, only modified sequences)?
How can differences between old and new
computational results be detected?
Changes in computational results may need to
trigger changes in curated annotations

27
Drosophila melanogaster

Drosophila is the most important model organism
Drosophila genome
4 chromosomes
180 Mb total sequence
140 Mb euchromatic sequence
12-14,000 genes

source G.M. Rubin
28
Drosophila Genome Project

Laboratories working on Drosophila sequencing
BDGP (Berkeley Drosophila Genome Project)
EDGP (European Drosophila Genome Project)
Celera Genomics Inc.
Complete D. melanogaster sequence will be
finished by the end of 1999
Comprehensive database - FlyBase

29
Goals of the Drosophila Genome Project

Complete genome sequence
Structure of all transcripts
Expression pattern of all genes
Phenotype resulting from mutation of all ORFs
And more...

30
Sequencing at the BDGP

Genomic sequence
P1 and BAC clones
24Mb of completed sequence (as of July 22, 1999)
18Mb unfinished sequence in process
Complete tiling path in BACs
1.5x-path draft sequencing
ESTs and cDNAs
80,942 ESTs finished (as of March 19, 1999)
Over 800 full-length cDNAs

31
The BDGP sequence annotation process
32
What sequence to start with?

Unit of sequencing at the BDGP
Completed high-quality clone sequences
Reassembling the genomic sequence
Need to place clones in correct genomic positions
Need to integrate genes that span multiple clones
Solved by using genomic overlaps to reconstitute
full genomic sequence

33
Which analyses need to be run?

Similarity searches
BLAST (Altschul et al., 1990)
BLASTN (nucleotide databases)
BLASTX (amino acid databases)
TBLASTX (amino acid databases, six-frame
translation)
sim4 (Miller et al., 1998)
Sequence alignment program for finding
near-perfect matches between nucleotide sequences
containing introns
Gene predictors
Genefinder (Green, unpublished)
GenScan (Burge and Karlin, 1997)
Genie (Reese et al., 1997)
Other analyses
tRNAscanSE (Lowe and Eddy, 1996)

34
Which analyses need to be run and how?

mRNAs
ORFFinder(Frise, unpublished)
Protein translations
HMMPFAM 2.1 (Eddy 1998) against PFAM (v 2.1.1
Sonnhammer et al. 1997, Bateman et al. 1999)
Ppsearch (Fuchs 1994) against ProSite (release
15.0) filtered with EMOTIF ( Nevill-Manning et
al. 1998)
Psort II (Horton and Nakai 1997)
ClustalW (Higgins et al. 1996)

35
What public sequence data sets are needed?

Automating updates of public databases
Genbank, SwissProt, trEMBL, BLOCKS, dbEST, EDGP
Curated data sets
D. melanogaster genes (FlyBase)
Transposable elements (EDGP)
Repeat elements (EDGP)
STSs (BDGP)

36
Which analyses need to be run and how?
37
How do you achieve computational throughput?

BDGP computing power
Sun Ultra 450 (3 machines, 4 processors each)
Sun Enterprise (1 machine, 8 processors)
Used these directly, without any system for
distributed computing.
Job flow control the Genomic Daemon
Automatic batch analysis of genomic clones
Berkeley Fly Database is used for queuing system
and storage of results
Many clones can be analyzed simultaneously
Results are processed and saved in XML format for
interactive browsing

38
What do you do with the results?

Berkeley Output Parser (BOP)
Input to BOP
Genomic sequence
Results of computational analyses
Filtering preferences
Parses results from BLAST, sim4, GeneFinder,
GenScan, and tRNAscan-SE analyses
Filters BLAST and sim4 results
Eliminates redundant or insignificant hits
Merges hits that represent single region of
homology
Homogenizes results into single format
Output sequence and filtered results in XML
format

39
Is human curation needed?

Not for everything
Some features are obvious and can be identified
computationally
Known D. melanogaster genes are detected
automatically by GeneSkimmer
Repetitive elements
But still for many things
Annotating complete gene structure is still hard
We use CloneCurator (BDGPs Java graphical
editor) for curation

40
Gene Skimmer

Quick way of identifying genes in new sequence
before curation
Start with XML output from BOP
Look for sim4 hits with known Drosophila genes
Find gene hits with sequence identity gt98,
coverage gt30
Verify that hits represent real genes

41
Gene Skimmer
URL http//www.fruitfly.org/sequence/genomic-clo
nes.html
42
CloneCurator

Displays computational results and annotations on
a genomic clone
Interactive browsing
Zoom/scroll
Change cutoffs for display of results
Analyze GC content, restriction sites, etc.
Interactive annotation editing
Expert endorses selected results
Presents annotations to community via Web site

43
(No Transcript)
44
How do we annotate gene/protein function?

Gene Ontology Project
Controlled hierarchical vocabulary for
multiple-genome annotations and comparisons
Standardized vocabulary facilitates collaboration
Good data modeling allows better database
querying
Ontology browser provides interactive search of
hierarchical terms
GO project (http//www.ebi.ac.uk/ashburn/GO)

45
Ontology browser
46
(No Transcript)
47
Ontology browser searching for terms
48
How do you distribute the data?

Bulk downloads
FASTA at http//www.fruitfly.org/sequence/download
.html
Curated data sets
Tabular data
At http//www.fruitfly.org/sequence/
Sequenced genomic clones
Clone contigs sorted by genomic location
Clone contigs sorted by size
Ribbon provides integrated graphical view of
annotations on physical contigs

49
Ribbon

Human curator annotates individual clones
(100Kb)
Clones are assembled into physical contigs
(regions of physical map)
Clone annotations are merged and renumbered for
display on whole physical contigs
Ribbon is our Java display tool for displaying
curated annotations on physical contigs
Will soon be available on Web

50
Ribbon
51
How do you manage the data?

Using Informix as our database server
Updated via Perl dbi.pm module
Development underway in
Schema revisions
GAME DTD (Genome Annotation Markup Entities)
Perl module for annotation objects
http//www.bioxml.org/ (Ewan Birney)

52
How do you maintain annotations?

Open questions
How frequently are annotations re-evaluated?
How can re-evaluation be minimized (only subsets
of the databanks, only modified sequences)?
How can differences between old and new
computational results be detected?
Changes in computational results may need to
trigger changes in curated annotations

53
Integrated annotation systems

ACeDB
Genotator
Magpie
GAIA
TIGR

54
Integrated annotation systems ACeDB

Developed for analysis of the C. elegans genome
Sophisticated database designed for storing
annotations and related information
New Java and Web-based versions available
Written by Jean Thierry-Mieg and Richard Durbin
http//www.sanger.ac.uk/Software/Acedb/

55
ACeDB
56
Genotator

Back end automates sequence analysis browser
provides interactive viewing and editing of
annotations
Nomi Harris (1997), Genome Research 7(7),
754-762.
http//www-hgc.lbl.gov/inf/annotation.html

57
Magpie

Expert system based (PROLOG)
Data collection daemon
Data analysis and report daemon
Intelligent integration of various individual
feature prediction systems
Allows human interactions
Gaasterlund and Sensen (1996), TIG, 12, 76-78.
http//genomes.rockefeller.edu/magpie/magpie.html

58
GAIA

Web-based system
Results displayed as Java applets
Bailey, L.C., J. Schug, S. Fischer, M. Gibson, J.
Crabtree, D.B. Searls, and G.C. Overton (1998),
Genome Research.
http//daphne.humgen.upenn.edu1024/gaia/

59
TIGR Human Gene Index

Gene Indices for various organisms
Databases for transcribed genes linked into
external/internal genomic databases
Internal backend analysis software
http//www.tigr.org/tdb/tdb.html

60
Computational analysis tools

Gene finding
Repeat finding
EST/cDNA alignment
Homology searching
BLAST, FASTA, HMM-based methods, etc.
Protein family searching
PFAM, Prosite, etc.

61
Gene finding Prokaryotes vs. Eukaryotes

Prokaryotes
Contiguous open reading frames (ORF)
Short intergenic sequences
Good method detecting large ORFs
Complications
Partial sequences
Sequencing errors
Start codon prediction
Overlapping genes on both strands

62
Gene finding Prokaryotes vs. Eukaryotes

Eukaryotes
Complex gene structures (exon/introns)
D. melanogaster has an average of 4 introns/gene
Very long genes (D. melanogaster X gene 160 kb)
Very long introns
Many introns
Nested, overlapping, and alternatively spliced
genes
5 UTRs with non-coding exons
Long 3 UTRs
Complex transcription machinery
ORF-finding alone is not adequate

63
Integrated gene finding

Assumptions
Signals and content method sensors alone are not
sufficient for predicting gene structure
Gene structure is hierarchical
Each component (exon, intron, splice site, etc.)
can be modeled independently
The approach
Generate a list of candidates for each component
(with scores)
Assemble the components into a gene model

64
Integrated gene finding Dynamic programming

Determines the best combination of components
Two-part problem
Develop an optimal scoring function
Use dynamic programming to find an optimal
alignment through scoring matrix

65
Integrated gene finding Dynamic programming
66
Integrated gene finding Linear and Quadratic
Discriminant Analysis (LDA/QDA)

LDA
Deterministic calculation of thresholds
n-class discrimination
Example
HSPL, Solovyev et al. (1997), ISMB, 5,294-302.
QDA
Can represent a great improvement over LDA
Example
MZEF, Michael Zhang (1997), PNAS, 94, 565-568.

67
Integrated gene finding Feed-forward neural
networks

Supervised learning
Training to discriminate between several feature
classes
Computing units
Gradient descent optimization
Multi-layer networks
Limitations
Black-box predictions
Local minima
Example
GRAIL, Uberbacher et al. (1991), PNAS, 88,
11261-11265.

68
Approaches to gene finding Hidden Markov models

Model
A finite model describing a probability
distribution over all possible sequences of equal
length
Natural scoring function
(Conditional) Maximum likelihood training
Markov
k-order Markov chain current state dependent on
k previous states
The next state in a 1st-order Markov model
depends on current state
Hidden
Hidden states generate visible symbols
Assumptions
Independence of states
No long range correlation
Example HMMgene, A. Krogh (1998), In Guide to
Human Genome Computing, 261-274.

69
Approaches to gene finding Generalized hidden
Markov models

Each HMM state can be a probabilistic sub-model
Complex hierarchical system
Requires care in modeling state overlaps
Example
Genie, Kulp et al. (1996), ISMB, 4, 134-142
GenScan, Burge and Karlin (1997), JMB, 268(1),
78-94

70
Gene finding software

Signal recognition
Promoter prediction
Splice site prediction
Start codon prediction
Poly-adenylation site prediction
Coding potential
Coding exons
Gene structure prediction
Spliced alignment
LDA/QDA
Neural networks
HMMs and GHMMs

71
Promoter recognition

PromoterScan
Identify potential promoter regions
Based on databases of known TF binding sites
TFD (Gosh (1991), TIBS, 16, 445-447)
TRANSFAC (Heinemeyer et al. (1999), NAR, 27,
318-322)
Prestridge (1995), JMB, 249, 923-932
http//bimas.dcrt.nih.gov/molbio/proscan/
MatInd and MatInspector
Finding consensus matches to known TF binding
sites
Based on TRANSFAC
Heinemeyer et al. (1999), NAR, 27, 318-322
Quandt et al. (1995), NAR, 23, 4878-4884.
http//transfac.gbf.de/TRANSFAC/

72
Promoter recognition (cont.)

TSSG/TSSW
LDA based combination of several features
(TATA-box, Inr signal, upstream regions)
Solovyev et al. (1997), ISMB, 5, 294-302.
http//genomic.sanger.ac.uk/gf/gf.shtml
Transcription Element Search Software
Identify TF binding sites
Based on TRANSFAC
http//agave.humgen.upenn.edu/tess/index.html

73
Promoter recognition (cont.)

CBS Promoter 2.0 Prediction Server
Simulated transcription factors
Principles common to neural networks and genetic
algorithms
Knudsen (1999), Bioinformatics 13(5), 356-361.
http//genome.cbs.dtu.dk/services/promoter/
CorePromoter
Position dependent 5-tuple
QDA
Michael Zhang (1998), Genome Research, 8,
319-326.
http//scislio.cshl.org/genefinder/CPROMOTER/

74
Promoter recognition (cont.)

Neural network promoter prediction (NNPP)
Time-delay neural network
Combining TATA box and initiator
Reese (1999), in preparation.
http//www-hgc.lbl.gov/projects/promoter.html

75
Example NNPP
76
Promoter recognition (cont.)

Markov chain promoter finder
Competing interpolated Markov chains for
promoters, exons, introns
Promoter model consists of five states
representing the core promoter parts
Ohler, Reese et al., Bioinformatics 13(5),
362-369.

77
Splice site prediction

Nakata, 1985
Nakata (1985), NAR, 13(14), 5327-5340.
BCM GeneFinder
HSPL - Prediction of splice sites in human DNA
sequences
Triplet frequencies in various functional parts
of splice site regions
Combined with codon statistics
Solovyev et al. (1994), NAR, 22(24), 5156-5163.
http//genomic.sanger.ac.uk/gf/gf.shtml

78
Splice site prediction (cont.)

Neural Network splice site predictor (NNSPLICE)
Multi-layered feed-forward neural network
Modeled after Brunak et al. (1991), JMB, 220,
49-65.
Reese et al. (1997), JCB, 4(3), 311-323.
http//www-hgc.lbl.gov/projects/splice.html
NetGene2
Combination of neural networks and rule-based
system
Splice site signal neural network combined with
coding potential
Hebsgaard et al. (1996), NAR, 24(17), 3439-3452.
Brunak et al. (1991), JMB, 220, 49-65.
http//www.cbs.dtu.dk/services/NetGene2/

79
Splice site prediction (cont.)

SplicePredictor
Logitlinear models for splice site regions
Degree of matching to the splice site consensus
Local compositional contrast
Brendel and Kleffe (1998), NAR, 26(20),
4748-4757.
http//gnomic.stanford.edu/volker/SplicePredictor
.html

80
Start codon prediction

NetStart
Trained on cDNA-like sequences
Neural network based
Local start codon information
Global sequence information
Pedersen and Nielsen (1997), ISMB, 5, 226-233.
http//www.cbs.dtu.dk/services/NetStart/

81
Poly-adenylation signal prediction

BCM GeneFinder
POLYAH - Recognition of 3'-end cleavage and
poly-adenylation region
Triplet frequencies in various functional parts
in poly-adenylation regions
LDA
Solovyev et al. (1994), NAR, 22(24), 5156-5163.
http//genomic.sanger.ac.uk/gf/gf.shtml

82
Prediction of coding potential

Periodicity detection
Coding sequences have an inherent periodicity of
three
Especially good on long coding sequences
Auto-correlation
Seeking the strongest response when shifted
sequence is compared with original
Michel (1986), J. Theor. Biol. 120, 223-236.
Fourier transformation Spectral analysis
Detection of peak at position corresponding to
1/3 of the frequency
Silverman and Linsker (1986), J. Theor. Biol.
118, 295-300.

83
Prediction of coding potential (cont.)

Trifonov (19801987)
G-notG-U periodicity
JMB , 194, 643-652.
Fickett (1982)
Position asymmetry in the three codon positions
NAR 10(17), 5303-5318.
Staden (1984)
Codon usage in tables
NAR 12, 551-567.

84
Prediction of coding potential (cont.)

Claverie and Bougueleret (1987)
Hexamer frequency differentials
NAR 14, 179-196.
Fichant and Gautier (1987)
Codon usage homogeneity
CABIOS, 3(4), 287-295.
GRAIL I (1991)
Neural network using a shifting fixed size window
7 sensors as input, 2 hidden layers and 1 unit as
output
Uberbacher et al. (1991), PNAS, 88(24),
11261-11265.

85
Prediction of coding potential (cont.)

GeneMark (1986)
Inhomogeneous Markov chain models
Easy trainable (closed solution for Maximum
Likelihood)
Used extensively in prokaryotic genomes
Borodovsky et al. (1993), Computers Chemistry,
17, 123-133.
Glimmer (1998)
Interpolated Markov chains from first to eighth
order
Salzberg et al. (1998), NAR, 26(2), 544-548.
http//www.tigr.org/softlab/glimmer/glimmer.html

86
Prediction of coding potential (cont.)

Review by Fickett (1992)
Assessment of protein coding measures, NAR, 20,
6441-6450.

87
Prediction of coding exons

SorFind
Detection of spliceable ORFs
Hutchinson, NAR, 20(13), 3453-3462.
BCM GeneFinder
FEXD, FEXN, FEXA, FEXY, FEXH, HEXON
LDA
Solovyev et al. (1994), NAR, 22(24), 5156-5163.
http//genomic.sanger.ac.uk/gf/gf.shtml
GRAIL II
Exon candidates, heuristic integration, learning
with neural network
Uberbacher et al., Genet. Eng., 16, 241-253.
http//compbio.ornl.gov/

88
Integrated gene models LDA/QDA

FGene
LDA based
Dynamic programming for the integration of LDA
output
Solovyev et al. (1995), ISMB, 3, 367-375.
http//genomic.sanger.ac.uk/gf/gf.shtml

89
Integrated gene models NN

GeneParser
Gene-parsing approach
Potential alternative splicing recognized
Neural network and dynamic programming
Snyder and Stormo (1995), JMB, 248, 1-18.

90
Integrated gene models Artificial intelligence
approaches

GeneID
Rule-based system
Homology integration
Guigó et al. (1992), JMB , 226, 141-157.
http//www1.imim.es/geneid.html
GeneID using DP
DP to combine a set of potential exons
Guigó et al. (1998), JCB , 5, 681-702.

91
Integrated gene models Artificial intelligence
approaches

GenLang
Syntactic pattern recognition system
Formal grammar
Tools from computational linguistics
Dong and Searls (1994), Genomics, 23,540-551.
http//cbil.humgen.upenn.edu/sdong/genlang_home.h
tml

92
Integrated gene models HMMs

HMMGene
Several genes per sequence possible
User constraints possible
Krogh (1997), ISMB, 5, 179-186.
http//www.cbs.dtu.dk/services/HMMgene/
GeneMark.hmm
Based on GeneMark program for bacterial sequences
Can predict frame shifts
Trained for various organisms
Lukashin and Borodovsky (1998), NAR, 26,
1107-1115.
http//genemark.biology.gatech.edu/GeneMark/hmmcho
ice.html

93
Integrated gene models GHMMs

Genie
Generalized hidden Markov model with length
distribution
Integration of multiple content and signal
sensors
Content codon statistics, repeats, intron,
intergenic, database homology hits
Signal promoter, start codon, splice sites, stop
codon
Dynamic programming to find optimal parse
Several genes per sequence possible
Kulp et al. (1996), ISMB, 4, 134-142.
Reese et al. (1997), JCB, 4(3), 311-323.
http//www.cse.ucsc.edu/dkulp/cgi-bin/genie

94
Example Genie
95
Integrated gene models GHMMs

GenScan
Multiple content and signal models
Semi-hidden Markov model sensors with length
distribution
Takes GC content into account (separate models)
Several genes per sequence possible
Burge and Karlin (1997), JMB, 268(1), 78-94.
http//CCR-081.mit.edu/GENSCAN.html

96
EST/cDNA alignment for gene finding Spliced
alignments

PROCRUSTES
Spliced alignment algorithm
Dynamic programming to combine a set of potential
exons
Frame conservation
Homologous sequence needed
Gelfand et al. (1996), PNAS, 93, 9061-9066.
http//hto-13.usc.edu/software/procrustes/

97
EST/cDNA alignment

Sim4
Aligns cDNA to genomic sequence
Uses local similarity
Florea et al. (1998), Genome Research, 8,
967-974.
GeneWise
Dynamic programming
Partial genes allowed
Based on Pfam and statistical splice site models
Birney (1999), unpublished
http//www.sanger.ac.uk/Software/Wise2

98
EST/cDNA alignment (cont.)

ACEMBLY
Aligns ESTs to genomic sequence
Identifies alternative splicing
Integrated in ACeDB
Jean Thierry-Mieg (unpublished)

99
Repeat finders

Censor
Uses database of repeat sequences
Jurka et al. (1996), Comp. and Chem., 20(1),
119-122.
BLAST
Integrated masking operations
XBLAST procedure
Claverie (1994), In Automated DNA Sequencing and
Analysis Techniques, M. D. Adams, C. Fields and
J. C. Venter, eds., 267-279.
http//www.ncbi.nlm.nih.gov/BLAST

100
Repeat finders (cont.)

RepeatMasker
Detection of interspersed repeats
Smit and Green, unpublished results
http//ftp.genome.washington.edu/RM/RepeatMasker.h
tml

101
Homology searching

BLAST suite
BLASTN, BLASTX, TBLASTX, PSI-BLAST
Altschul et al. (1990), JMB, 215, 403-410.
http//www.ncbi.nlm.nih.gov/BLAST
FASTA suite
FASTA, TFASTA
Pearson and Lipman (1988), PNAS, 85, 2444-2448.
HMM-based searching
SAM (UCSC group)
http//www.cse.ucsc.edu/research/compbio/sam.html
HMMER, Sean Eddy
http//hmmer.wustl.edu/

102
Gene family searching

BLOCKS
http//www.blocks.fhcrc.org
PROSITE
http//www.expasy.ch/prosite/
PFAM
http//pfam.wustl.edu/
SCOP
http//scop.mrc-lmb.cam.ac.uk/scop/

103
The genome annotation experiment (GASP1)

Genome Annotation Assessment Project (GASP1)
Annotation of 2.9 Mb of Drosophila melanogaster
genomic DNA
Open to everybody, announced on several mailing
lists
Participants can use any analysis methods they
like (gene finding programs, homology searches,
by-eye assessment, combination methods, etc.) and
should disclose their methods.
CASP like
12 participating groups

104
URL http//www.fruitfly.org/GASP1
105
Goals of the experiment

Compare and contrast various genome annotation
methods
Objective assessment of the state of the art in
gene finding and functional site prediction
Identify outstanding problems in computational
methods for the annotation process

106
Adh contig

2.9 Mb contiguous Drosophila sequence from the
Adh region, one of the best studied genomic
regions
From chromosome 2L (34D-36A)
Ashburner et al., (to appear in Genetics)
222 gene annotations (as of July 22, 1999)
375,585 bases are coding (12.95)
We chose the Adh region because it was thought to
be typical. A representative test bed to evaluate
annotation techniques.

107
Adh paper (to appear in Genetics)
URL http//www.fruitfly.org/publications/PDF/ADH.
pdf
108
Raw sequence Adh.fa

GAATTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGAC
GTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCT
TCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTG
CCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGC
CAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTG
GTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTA
CTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAA
CTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGG
CCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATT
CCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATAC
TGAGCCCAAATGAGCGATAGATAGATAGATCGTGCGGCGATCTCGTACTG
GTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTC
TGGTTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTC
TCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTG
GGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAAC
GATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTG
CCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAG
CTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGG
ACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGT
TTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGC
CGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTAAAGTAACCTGCGGGAA
TTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAAT
ACTGAGCCCAAATGAGCGATAGATAGATAGATCGTGCGGCGATCTCGTAC
TGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTT
TCTGGTTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGT
TCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTG
TGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAA
ACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACT
TGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGC
AGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATG
GGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCT
GTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTT
GCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCC
CTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCG
GCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGT
ATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCAT
CGAAAAGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTC
TGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGG
CGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGA
TTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCC
CTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCT
GGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGAC
CTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTT
GACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGG
TTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGG
ACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTC
CTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGT
TGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACG
GCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTG
TGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCG
TACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACA
AACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGT
GGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAA
TTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAAT
ACTGTGCGGCGATCTCGTACTGGACGGAAATGTCAGGAGATAGGAGAAGA
AAA

109
Drosophila data sets provided to participants

Curated Drosophila nuclear DNA "coding sequences"
(CDS)
Curated non-redundant Drosophila genomic DNA data
(275 multi- and 144 single-exon sequence
entries from Genbank)
Drosophila 5' and 3' splice sites
Drosophila start codon sites
Drosophila promoter sequences
Drosophila repeat sequences
Drosophila transposon sequences
Drosophila cDNA sequences
Drosophila EST sequences

URL http//www.fruitfly.org/GASP1/data/data.html
110
Timetable

May 13, 1999 - June 30, 1999
Distribution of the sample sequence and
associated data to the predictors. Collection of
predictions.
June 30, 1999 - July 31, 1999
Evaluation of the predictions by the Drosophila
Genome Center.
August 4, 1999
External expert assessment of the prediction
results (HUGO meeting, EMBL)
August 6, 1999
Tutorial 3 at the ISMB 99 conference in
Heidelberg, Germany

111
Resources for assessing predictions

80 cDNA sequences NOT in Genbank before
experiment deadline
Sequenced from 5 different cDNA libraries
3 paralogs to other genes in the genome
19 cDNAs with cloning artifacts
2 apparently representing unspliced RNA
Multiple inserts (2 cDNAs cloned in the same
vector)
58 usable cDNAs
33 cDNA sequences in Genbank during experiment
Annotations from Adh paper

112
Curated data sets for assessing predictions

Standard 1 (Adh.std1.gff) conservative gene set
43 gene structures (7 single- and 36 multi-
coding exon genes)
Criteria for inclusion
gt95 (most gt99) of the cDNA aligned to genomic
DNA (using sim4)
GT/AG splice site consensus sequences
Splice site score from neural net
5 splice sites gt0.35 threshold ( 98 True
Positive score)
3 splice sites gt0.25 threshold ( 92 True
Positive score)
Start codon and stop codon annotations from
Standard 3 (derived from Adh paper)
These 43 genes represent typical genes

113
Curated data sets for assessing predictions

Standard 2 (Adh.std2.gff)
Superset of Standard 1
15 additional gene structures
Same alignment criteria as Standard 1 but no
splice site consensus requirement
Not used in the experiment

114
Curated data sets for assessment

Standard 3 (Adh.std3.gff) more complete gene
set
222 gene structures (39 single- and 183 multi-
coding exon genes)
Criteria
Annotated as described in Ashburner et al.
cDNA to genomic alignment using sim4
Start codons predicted by ORFFinder (Frise et
al., unpublished)
182 genes have similarity to a homologous
protein sequence in another organism or have a
Drosophila EST hit
Edge verification by partial EST/cDNA alignments
BLASTX, TBLASTX homology results
PFAM alignments
Gene structure verification using GenScan (human)
14 genes had EST/homology hits but no gene
finding predictions
40 genes only have strong GenScan predictions

115
Submission format

GFF (Durbin and Haussler, 1998, unpublished)
http//www.sanger.ac.uk/Software/GFF/

116
Sample submission
organism Drosophila melanogaster
std1 Adh std1 TFBS 32002
32006 . . Adh
std1 TATA_signal 32009 32012 .
. transcript "1" Adh std1
TSS 32033 32034 . .
transcript "1" Adh std1
prim_transcript 32034 33122 . .
transcript "1" Adh std1 exon
32034 32277 . .
transcript "1" Adh std1 start_codon
32122 32124 . .
transcript "1" Adh std1 CDS
32122 32277 . .
transcript "1" Adh std1 splice5
32277 32278 . .
transcript "1" Adh std1 splice3
32332 32333 . .
transcript "1" Adh std1 exon
32785 32830 . .
transcript "1" Adh std1 CDS
32785 32830 . .
transcript "1" Adh std1 splice5
32830 32831 . .
transcript "1" Adh std1 splice3
32825 32826 . .
transcript "1" Adh std1 CDS
32826 33003 . .
transcript "1" Adh std1 exon
32826 33122 . .
transcript "1" Adh std1 stop_codon
33001 33003 . .
transcript "1" Adh std1 polyA_signal
33090 33095 . .
transcript "1" Adh std1 polyA_site
33101 33102 . .
transcript "1" Adh std1
prim_transcript 38100 41973 . - .
transcript "2" Adh std1 exon
38100 41973 . - .
transcript "2" Adh std1 polyA_site
39620 39621 . - .
transcript "2" Adh std1 polyA_signal
39685 39690 . - .
transcript "2" Adh std1 stop_codon
40125 40127 . - .
transcript "2" Adh std1 CDS
40125 40390 . - .
transcript "2" Adh std1 start_codon
40388 40390 . - .
transcript "2" Adh std1 TSS
41973 41974 . - .
transcript "2" Adh std1 TATA_signal
41998 42001 . - .
transcript "2" Adh std1 TFBS
42187 42193 . - .
Adh std1 TFBS 42211 42216 . -
.
Gene 1
Gene 2
117
Submissions

MAGPIE Team
Credit
Terry Gaasterland, Alexander Sczyrba, Elizabeth
Thomas, Gulriz Kurban, Paul Gordon, Christoph
Sensen
Laboratory for Computational Genomics,
Rockefeller and Institute for Marine Biosciences,
Canada
Method
Automatic genome analysis system integrating
Drosophila Genscan predictions, confirming exons
boundaries using database searches, repeat
finding (Calypso, REPupter) and gene function
annotations.

118
Submissions (cont.)

References
Multigenome MAGPIE poster at ISMB 99.
Gaasterland and Ragan (1998), J. of Microbial and
Comparative Genomics, 3, 305-312.
Gaasterland and Sensen (1996), Biochimie 78,
302-310.
REPupter Kurtz and Schleiermacher (1999),
Bioinformatics 15(5), 426-427.

119
Submissions (cont.)

Computational Genomics Group, The Sanger Centre
Credit
Victor Solovyev, Asaf Salamov
Method
Discriminant analysis based gene prediction
programs FGenes (trained for Human) and FGenesH
(trained for Drosophila) Combining the output of
Fgenes, FGenesH and BLAST using FGenesH. 3
different threshold annotations are submitted.
The programming running time is linear with the
sequence length.
Automatic, plus additional user interactive
screening.
Non-redundant NCBI database used for BLAST.
URL/References
http//genomic.sanger.ac.uk/gf/gf.shtml

120
Submissions (cont.)

Genome Annotation Group, The Sanger Centre
Credit
Ewan Birney
Method
Protein family based gene identification using
Wise2 (previously Genewise) and PFAM.
URL
http//www.sanger.ac.uk/Software/Wise2

121
Submissions (cont.)

Pattern Recognition, The University of Erlangen
Credit
Uwe Ohler, Georg Stemmer, Stefan Harbeck,
Heinrich Niemann
Method
Promoter recognition based on interpolated Markov
chains Genscan like promoter model
(MCPromoter) maximal mutual information based
estimation of interpolated Markov chains.
Automatic.
Promoter training data set from
http//www.fruitfly.org/data/geneset
s

122
Submissions (cont.)

References
Ohler, Harbeck, Niemann, Noeth and Reese (1999),
Bioinformatics 15(5), 362-369.
Ohler, Harbeck and Niemann (1999), Proc.
EUROSPEECH, to appear.
URL
http//www5.informatik.uni-erlangen/HTML/English/R
esearch/Promoter

123
Submissions (cont.)

Computational Biosciences, Oakridge National
Laboratory
Credit
Richard J. Mural, Douglas Hyatt, Frank Larimer,
Manesh Shah, Morey Parang
Method
Integrated neural network based system including
gene assembly using EST and homology information
(GRAILexp).
URL
http//compbio.ornl.gov/droso

124
Submissions (cont.)

Center for Biological Sequence Analysis,
Technical University of Denmark
Credit
Anders Krogh
Method
Modular HMM incorporating database hits (proteins
and ESTs/cDNAS) and other external information
probabilistically (HMMGene) the HMM has modules
for coding regions, splice sites, translation
start/stop, etc..
It will be a fully automated system.
Trained on Drosophila data
http//www.fruitfly.org/GSAC1/data/data.html
and
Victor Solovyev (personal communication)

125
Submissions (cont.)

References
Krogh (1998), In S.L. Salzberg et al., eds.,
Computational Methods in Molecular Biology,
45-63, Elsevier.
Krogh (1997), Gaasterland et al., eds., Proc.
ISMB 97, 179-186.
http//www.cbs.dtu.dk/krogh/refs.html
URL
http//www.cbs.dtu.dk/services/HMMgene/
Not yet for Drosophila.

126
Submissions (cont.)

BLOCKS group, Fred Hutchinson Cancer Research
Center in Seattle, Washington
Credit
Jorja Henikoff, Steve Henikoff
Method
DNA translation in 6 frames and search against
BLOCKS and against BLOCKS extracted from
Smart3.0 (http//coot-embl-heidelberg.de/SMART/)
using BLIMPS automatic post-processing to join
multiple predictions from the same block.
Automatic with some user interactive screening of
results.

127
Submissions (cont.)

References
Henikoff, Henikoff and Pietrokovski (1999), Nucl.
Acids Res., 27, 226-228.
Henikoff and Henikoff (1994), Proc. 27th Ann.
Hawaii Intl. Conf. On System Sciences, 265-274.
Henikoff and Henikoff (1994), Genomics, 19,
97-107.
URL
http//blocks.fhcrc.org
http//blocks.fhcrc.org/blocks-bin/getblock.sh?ltbl
ock namegt

128
Submissions (cont.)

Genome Informatics Team, IMIM, Barcelona, Spain
Credit
Roderic Guigó, Josep F. Abril, Enrique Blanco,
Moises Burset, Genis Parra
Method
Dynamic programming based system to combine
potential exon candidates modeled as a fifth
order Markov model and functional sequence sites
modeled as a position weight matrix (Geneid
version 3).
Fully automatic, very fast.
Trained on Drosophila data
http//www.fruitfly.org/GSAC1/data/data.html

129
Submissions (cont.)

References
Guigó et al. (1998), JCB , 5, 681-702.
URL
Information on training process
http//www1.imim.es/rguigo/AnnotationExperiment/i
ndex.html
http//www1.imim.es/geneid.html

130
Submissions (cont.)

Mark Borodovsky's Lab, School of Biology, Georgia
Institute of Technology
Credit
Mark Borodovsky, John Besemer
Method
Markov chain models combined with HMM technology
(Genemark.hmm).
URL
http//genemark.biology.gatech.edu/GeneMark/hmmcho
ice.html

131
Submissions (cont.)

Biodivision, GSF Forschungszentrum für Umwelt und
Gesundheit, Neuherberg, Germany
Credit
Matthias Scherf, Andreas Klingenhoff, Thomas
Werner
Method
Universal sequence classifier which is based on a
correlated word analysis to predict initiators
and promoter associated TATA boxes (CoreInspector
V1.0 beta). Sequences of 100 bp are classified at
once.
Trained on Eukaryotic Promoter Database (EPD
version 5.9).
Fully automatic, 2 seconds per 1Kb.
References
Scherf et al. (1999), in prepa