The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster - PowerPoint PPT Presentation

About This Presentation
Title:

The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster

Description:

Reese et al., Tutorial #3, ISMB 99. The challenge of annotating a complete eukaryotic genome: ... TATA box. Initiator (Inr) Downstream promoter element (DPE) ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 183
Provided by: martin128
Learn more at: https://www.fruitfly.org
Category:

less

Transcript and Presenter's Notes

Title: The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster


1
The challenge of annotating a complete eukaryotic
genomeA case study in Drosophila melanogaster
  • Martin G. Reese (mgreese_at_lbl.gov)
  • Nomi L. Harris (nlharris_at_lbl.gov)
  • George Hartzell (hartzell_at_cs.berkeley.edu)
  • Suzanna E. Lewis (suzi_at_fruitfly.berkeley.edu)
  • Drosophila Genome CenterDepartment of Molecular
    and Cell Biology539 Life Sciences
    AdditionUniversity of California, Berkeley

2
Abstract
Many of the technical issues involved in
sequencing complete genomes are essentially
solved. Technologies already exist that provide
sufficient solutions for ascertaining sequencing
error rates and for assembling sequence data.
Currently, however, standards or rules for the
annotation process are still an outstanding
problem. How shall the genomes be annotated,
what shall be annotated, which computational
tools are most effective, how reliable are these
annotations, how organism-specific do the tools
have to be and ultimately how should the
computational results be presented to the
community? All these questions are unsolved. This
tutorial will give an overview and assessment of
the current state of annotation based upon
experiences gained at the Drosophila melanogaster
genome project. In the tutorial we will do three
things. First, we will break down the annotation
process and discuss the various aspects of the
problem. This will serve to clarify the term
"annotation", which is often used to collectively
describe a process that has a number of discrete
steps. Second, with the participation of
computational biologists from the community we
will compare existing tools for sequence
annotation. We will do this by providing a 3
megabase sequence that has already been
well-characterized at our center as a testbed for
evaluating other feature-finding algorithms. This
is similar to what has been done at the CASP
(critical assessment of techniques for protein
structure prediction) conferences
(http//predictioncenter.llnl.gov) for protein
structure prediction. Third, we will discuss
which annotation problems are essentially solved
and which problems remain.

3
Tutorial goals
  • Review the algorithms currently used in
    annotation
  • Assess existing methods under field conditions
  • Identify open issues in annotation

4
Tutorial organization
  • Definitions
  • Annotation
  • Biological issues
  • Engineering issues
  • Application of tools within an existing
    annotation system
  • Break (20 minutes)
  • Review of existing tools
  • Our annotation experiment
  • Conclusions and outstanding issues

5
What is a gene?
  • Definition An inheritable trait associated with
    a region of DNA that codes for a polypeptide
    chain or specifies an RNA molecule which in turn
    have an influence on some characteristic
    phenotype of the organism.

6
What are annotations?
  • Definition Features on the genome derived
    through the transformation of raw genomic
    sequences into information by integrating
    computational tools, auxiliary biological data,
    and biological knowledge.

7
How does an annotation differ from a gene?
  • Many annotations are the same as genes
  • The annotation describes an inheritable trait
    associated with a region of DNA.
  • But an annotation may not always correspond in
    this way, e.g. an STS, or sequence overlap
  • Region of genomic DNA or RNA is not translated or
    transcribed

8
Transcription and translation
9
Schematic gene structure
10
Sequence feature types
  • Transcribed region
  • mRNA, tRNA, snoRNA, snRNA, rRNA
  • Structural region
  • Exon, intron, 5 UTR, 3 UTR, ORF, cleavage
    product
  • Mutations insertion, deletion, substitution,
    inversion, translocation
  • Functional or signal region
  • Promoter, enhancer, DNA/RNA binding site, splice
    site signal, poly-adenylation signal
  • Protein processing glycosylation, methylation,
    phosphorylation site
  • Similarity
  • Homolog, paralog, genomic overlap (syntenic
    region)
  • Other feature types
  • Transposable element, repetitive element
  • Pseudogene
  • STS, insertion site

11
DNA transcription unit features
  • Promoter elements
  • Core promoter elements
  • TATA box
  • Initiator (Inr)
  • Downstream promoter element (DPE)
  • Transcription factor (TF) binding sites
  • CAAT boxes
  • GC boxes
  • SP-1 sites
  • GAGA boxes
  • Enhancer site(s)

12
mRNA features
  • Exon
  • Initial, internal, terminal
  • Codon usage, preference
  • Control elements (e.g. splice enhancers)
  • Intron
  • 5 splice site (GT), branchpoint (lariat), 3
    splice site (AG)
  • Repeat elements
  • Start codon (translation start site)
  • Kozak rule
  • UTR (untranslated regions)
  • 5 UTR
  • Translation regulatory elements
  • RNA binding sites
  • Initial, internal, terminal
  • Control elements (e.g. splice enhancers)
  • 3 UTR
  • RNA binding sites (cis-acting elements)
  • Stop codon
  • Poly-adenylation signal and site

13
(No Transcript)
14
Definitions for data modeling
  • Feature An interval or an ordered set of
    intervals on a sequence that describes some
    biological attribute and is justified by
    evidence.
  • Sequence A linear molecule of DNA, RNA or amino
    acids.
  • Evidence A computational or experimental result
    coming out of an analysis of a sequence
  • Annotation A set of features

15
Annotation
Annotated genome
Depth of knowledge
Breadth of knowledge
16
Annotation process overview
Methods
Data
Genome Sequence
Auxiliary Data
Computational Tools
Database Resources
Annotation Systems
Understanding of a Genome
17
Types of sequence data
  • Chromosomal sequence
  • Euchromatic
  • Heterochromatic
  • mRNA sequences
  • Full length cDNA
  • 5 EST
  • 3 EST
  • Protein sequences
  • Insertion site flanking sequences

18
Auxiliary data
  • Maps
  • Genetic, physical, radiation hybrid map (RH),
    deletion, cytogenetic
  • Expression data
  • Tissue, stage
  • Phenotypes
  • Lethality, sterility

19
Computational annotation tools
  • Gene finding
  • Repeat finding
  • EST/cDNA alignment
  • Homology searching
  • BLAST, FASTA, HMM-based methods, etc.
  • Protein family searching
  • PFAM, Prosite, etc.

20
Database resources
  • Curated sequence feature data sets
  • Repeat elements
  • Transposons
  • Non-redundant mRNA
  • STSs and other sequence markers
  • Genome sequence from related species
  • D. melanogaster vs. D. virilis, D. hydei
  • Genome sequence from more distant species
  • Protein sequences from distant species

21
Biological issues in annotation
  • Common
  • Genes within genes
  • Alternative splicing
  • Alternative poly-adenylation sites
  • Rare
  • Translational frame shifting
  • mRNA editing
  • Eukaryotic operons
  • Alternative initiation

22
Engineering issues in annotation
  • What sequence to start with?
  • Because features are intervals on a sequence,
    problems can be caused by gaps, frameshifts, and
    other changes to the sequence. How do you track
    these changes over time and model features that
    span gaps?
  • When to annotate?
  • Feature identification can aid in sequencing. It
    may be advisable to carry out sequencing and
    annotation in parallel thus enabling them to
    complement one another.
  • What analyses need to be run and how?
  • What dependencies are there between various
    analysis programs?
  • What parameters settings to use?

23
Engineering issues in annotation
  • What public sequence data sets are needed?
  • What are the mechanics of obtaining public
    sequence databases?
  • Are curated data sets available or do you need to
    set up a means of maintaining your own (for
    repeats, insertions, organism of interest)
  • How do you achieve computational throughput?
  • Workstation farm, or simply a big, powerful box?
  • Job flow control
  • What do you do with the results?
  • Homogenize results into single format?
  • Filter results for significance and redundancy

24
Engineering issues in annotation
  • Interpreting the results
  • Is human curation needed?
  • How can you achieve consistency between curators?
  • How do you design the user interface so that it
    is simple enough to get the task completed
    speedily but complex enough to deal with biology?
  • How do you capture curations?
  • How are annotation translations to be described?
  • EC terminology
  • ProSite families
  • Pfam domains
  • Is function distinguishable from process?

25
Engineering issues in annotation
  • How do you manage data?
  • What is the appropriate database schema design?
  • How is the database to be kept up to date? Will
    it be directly from programs running user
    interfaces and analyses or via a middleware
    layer?
  • Is a flat file format needed and what should it
    be?
  • What query and retrieval support is needed?
  • How do you distribute data?
  • For bulk downloads what is the format of the
    data?
  • What information is best summarized in tables?
  • What information requires an integrated graphical
    view?

26
Engineering issues in annotation
  • How do you update the annotations?
  • How frequently are they re-evaluated?
  • How can re-evaluation be minimized (only subsets
    of the databanks, only modified sequences)?
  • How can differences between old and new
    computational results be detected?
  • Changes in computational results may need to
    trigger changes in curated annotations

27
Drosophila melanogaster
  • Drosophila is the most important model organism
  • Drosophila genome
  • 4 chromosomes
  • 180 Mb total sequence
  • 140 Mb euchromatic sequence
  • 12-14,000 genes

source G.M. Rubin
28
Drosophila Genome Project
  • Laboratories working on Drosophila sequencing
  • BDGP (Berkeley Drosophila Genome Project)
  • EDGP (European Drosophila Genome Project)
  • Celera Genomics Inc.
  • Complete D. melanogaster sequence will be
    finished by the end of 1999
  • Comprehensive database - FlyBase

29
Goals of the Drosophila Genome Project
  • Complete genome sequence
  • Structure of all transcripts
  • Expression pattern of all genes
  • Phenotype resulting from mutation of all ORFs
  • And more...

30
Sequencing at the BDGP
  • Genomic sequence
  • P1 and BAC clones
  • 24Mb of completed sequence (as of July 22, 1999)
  • 18Mb unfinished sequence in process
  • Complete tiling path in BACs
  • 1.5x-path draft sequencing
  • ESTs and cDNAs
  • 80,942 ESTs finished (as of March 19, 1999)
  • Over 800 full-length cDNAs

31
The BDGP sequence annotation process
32
What sequence to start with?
  • Unit of sequencing at the BDGP
  • Completed high-quality clone sequences
  • Reassembling the genomic sequence
  • Need to place clones in correct genomic positions
  • Need to integrate genes that span multiple clones
  • Solved by using genomic overlaps to reconstitute
    full genomic sequence

33
Which analyses need to be run?
  • Similarity searches
  • BLAST (Altschul et al., 1990)
  • BLASTN (nucleotide databases)
  • BLASTX (amino acid databases)
  • TBLASTX (amino acid databases, six-frame
    translation)
  • sim4 (Miller et al., 1998)
  • Sequence alignment program for finding
    near-perfect matches between nucleotide sequences
    containing introns
  • Gene predictors
  • Genefinder (Green, unpublished)
  • GenScan (Burge and Karlin, 1997)
  • Genie (Reese et al., 1997)
  • Other analyses
  • tRNAscanSE (Lowe and Eddy, 1996)

34
Which analyses need to be run and how?
  • mRNAs
  • ORFFinder(Frise, unpublished)
  • Protein translations
  • HMMPFAM 2.1 (Eddy 1998) against PFAM (v 2.1.1
    Sonnhammer et al. 1997, Bateman et al. 1999)
  • Ppsearch (Fuchs 1994) against ProSite (release
    15.0) filtered with EMOTIF ( Nevill-Manning et
    al. 1998)
  • Psort II (Horton and Nakai 1997)
  • ClustalW (Higgins et al. 1996)

35
What public sequence data sets are needed?
  • Automating updates of public databases
  • Genbank, SwissProt, trEMBL, BLOCKS, dbEST, EDGP
  • Curated data sets
  • D. melanogaster genes (FlyBase)
  • Transposable elements (EDGP)
  • Repeat elements (EDGP)
  • STSs (BDGP)

36
Which analyses need to be run and how?
37
How do you achieve computational throughput?
  • BDGP computing power
  • Sun Ultra 450 (3 machines, 4 processors each)
  • Sun Enterprise (1 machine, 8 processors)
  • Used these directly, without any system for
    distributed computing.
  • Job flow control the Genomic Daemon
  • Automatic batch analysis of genomic clones
  • Berkeley Fly Database is used for queuing system
    and storage of results
  • Many clones can be analyzed simultaneously
  • Results are processed and saved in XML format for
    interactive browsing

38
What do you do with the results?
  • Berkeley Output Parser (BOP)
  • Input to BOP
  • Genomic sequence
  • Results of computational analyses
  • Filtering preferences
  • Parses results from BLAST, sim4, GeneFinder,
    GenScan, and tRNAscan-SE analyses
  • Filters BLAST and sim4 results
  • Eliminates redundant or insignificant hits
  • Merges hits that represent single region of
    homology
  • Homogenizes results into single format
  • Output sequence and filtered results in XML
    format

39
Is human curation needed?
  • Not for everything
  • Some features are obvious and can be identified
    computationally
  • Known D. melanogaster genes are detected
    automatically by GeneSkimmer
  • Repetitive elements
  • But still for many things
  • Annotating complete gene structure is still hard
  • We use CloneCurator (BDGPs Java graphical
    editor) for curation

40
Gene Skimmer
  • Quick way of identifying genes in new sequence
    before curation
  • Start with XML output from BOP
  • Look for sim4 hits with known Drosophila genes
  • Find gene hits with sequence identity gt98,
    coverage gt30
  • Verify that hits represent real genes

41
Gene Skimmer
URL http//www.fruitfly.org/sequence/genomic-clo
nes.html
42
CloneCurator
  • Displays computational results and annotations on
    a genomic clone
  • Interactive browsing
  • Zoom/scroll
  • Change cutoffs for display of results
  • Analyze GC content, restriction sites, etc.
  • Interactive annotation editing
  • Expert endorses selected results
  • Presents annotations to community via Web site

43
(No Transcript)
44
How do we annotate gene/protein function?
  • Gene Ontology Project
  • Controlled hierarchical vocabulary for
    multiple-genome annotations and comparisons
  • Standardized vocabulary facilitates collaboration
  • Good data modeling allows better database
    querying
  • Ontology browser provides interactive search of
    hierarchical terms
  • GO project (http//www.ebi.ac.uk/ashburn/GO)

45
Ontology browser
46
(No Transcript)
47
Ontology browser searching for terms
48
How do you distribute the data?
  • Bulk downloads
  • FASTA at http//www.fruitfly.org/sequence/download
    .html
  • Curated data sets
  • Tabular data
  • At http//www.fruitfly.org/sequence/
  • Sequenced genomic clones
  • Clone contigs sorted by genomic location
  • Clone contigs sorted by size
  • Ribbon provides integrated graphical view of
    annotations on physical contigs

49
Ribbon
  • Human curator annotates individual clones
    (100Kb)
  • Clones are assembled into physical contigs
    (regions of physical map)
  • Clone annotations are merged and renumbered for
    display on whole physical contigs
  • Ribbon is our Java display tool for displaying
    curated annotations on physical contigs
  • Will soon be available on Web

50
Ribbon
51
How do you manage the data?
  • Using Informix as our database server
  • Updated via Perl dbi.pm module
  • Development underway in
  • Schema revisions
  • GAME DTD (Genome Annotation Markup Entities)
  • Perl module for annotation objects
  • http//www.bioxml.org/ (Ewan Birney)

52
How do you maintain annotations?
  • Open questions
  • How frequently are annotations re-evaluated?
  • How can re-evaluation be minimized (only subsets
    of the databanks, only modified sequences)?
  • How can differences between old and new
    computational results be detected?
  • Changes in computational results may need to
    trigger changes in curated annotations

53
Integrated annotation systems
  • ACeDB
  • Genotator
  • Magpie
  • GAIA
  • TIGR

54
Integrated annotation systems ACeDB
  • Developed for analysis of the C. elegans genome
  • Sophisticated database designed for storing
    annotations and related information
  • New Java and Web-based versions available
  • Written by Jean Thierry-Mieg and Richard Durbin
  • http//www.sanger.ac.uk/Software/Acedb/

55
ACeDB
56
Genotator
  • Back end automates sequence analysis browser
    provides interactive viewing and editing of
    annotations
  • Nomi Harris (1997), Genome Research 7(7),
    754-762.
  • http//www-hgc.lbl.gov/inf/annotation.html

57
Magpie
  • Expert system based (PROLOG)
  • Data collection daemon
  • Data analysis and report daemon
  • Intelligent integration of various individual
    feature prediction systems
  • Allows human interactions
  • Gaasterlund and Sensen (1996), TIG, 12, 76-78.
  • http//genomes.rockefeller.edu/magpie/magpie.html

58
GAIA
  • Web-based system
  • Results displayed as Java applets
  • Bailey, L.C., J. Schug, S. Fischer, M. Gibson, J.
    Crabtree, D.B. Searls, and G.C. Overton (1998),
    Genome Research.
  • http//daphne.humgen.upenn.edu1024/gaia/

59
TIGR Human Gene Index
  • Gene Indices for various organisms
  • Databases for transcribed genes linked into
    external/internal genomic databases
  • Internal backend analysis software
  • http//www.tigr.org/tdb/tdb.html

60
Computational analysis tools
  • Gene finding
  • Repeat finding
  • EST/cDNA alignment
  • Homology searching
  • BLAST, FASTA, HMM-based methods, etc.
  • Protein family searching
  • PFAM, Prosite, etc.

61
Gene finding Prokaryotes vs. Eukaryotes
  • Prokaryotes
  • Contiguous open reading frames (ORF)
  • Short intergenic sequences
  • Good method detecting large ORFs
  • Complications
  • Partial sequences
  • Sequencing errors
  • Start codon prediction
  • Overlapping genes on both strands

62
Gene finding Prokaryotes vs. Eukaryotes
  • Eukaryotes
  • Complex gene structures (exon/introns)
  • D. melanogaster has an average of 4 introns/gene
  • Very long genes (D. melanogaster X gene 160 kb)
  • Very long introns
  • Many introns
  • Nested, overlapping, and alternatively spliced
    genes
  • 5 UTRs with non-coding exons
  • Long 3 UTRs
  • Complex transcription machinery
  • ORF-finding alone is not adequate

63
Integrated gene finding
  • Assumptions
  • Signals and content method sensors alone are not
    sufficient for predicting gene structure
  • Gene structure is hierarchical
  • Each component (exon, intron, splice site, etc.)
    can be modeled independently
  • The approach
  • Generate a list of candidates for each component
    (with scores)
  • Assemble the components into a gene model

64
Integrated gene finding Dynamic programming
  • Determines the best combination of components
  • Two-part problem
  • Develop an optimal scoring function
  • Use dynamic programming to find an optimal
    alignment through scoring matrix

65
Integrated gene finding Dynamic programming
66
Integrated gene finding Linear and Quadratic
Discriminant Analysis (LDA/QDA)
  • LDA
  • Deterministic calculation of thresholds
  • n-class discrimination
  • Example
  • HSPL, Solovyev et al. (1997), ISMB, 5,294-302.
  • QDA
  • Can represent a great improvement over LDA
  • Example
  • MZEF, Michael Zhang (1997), PNAS, 94, 565-568.

67
Integrated gene finding Feed-forward neural
networks
  • Supervised learning
  • Training to discriminate between several feature
    classes
  • Computing units
  • Gradient descent optimization
  • Multi-layer networks
  • Limitations
  • Black-box predictions
  • Local minima
  • Example
  • GRAIL, Uberbacher et al. (1991), PNAS, 88,
    11261-11265.

68
Approaches to gene finding Hidden Markov models
  • Model
  • A finite model describing a probability
    distribution over all possible sequences of equal
    length
  • Natural scoring function
  • (Conditional) Maximum likelihood training
  • Markov
  • k-order Markov chain current state dependent on
    k previous states
  • The next state in a 1st-order Markov model
    depends on current state
  • Hidden
  • Hidden states generate visible symbols
  • Assumptions
  • Independence of states
  • No long range correlation
  • Example HMMgene, A. Krogh (1998), In Guide to
    Human Genome Computing, 261-274.

69
Approaches to gene finding Generalized hidden
Markov models
  • Each HMM state can be a probabilistic sub-model
  • Complex hierarchical system
  • Requires care in modeling state overlaps
  • Example
  • Genie, Kulp et al. (1996), ISMB, 4, 134-142
  • GenScan, Burge and Karlin (1997), JMB, 268(1),
    78-94

70
Gene finding software
  • Signal recognition
  • Promoter prediction
  • Splice site prediction
  • Start codon prediction
  • Poly-adenylation site prediction
  • Coding potential
  • Coding exons
  • Gene structure prediction
  • Spliced alignment
  • LDA/QDA
  • Neural networks
  • HMMs and GHMMs

71
Promoter recognition
  • PromoterScan
  • Identify potential promoter regions
  • Based on databases of known TF binding sites
  • TFD (Gosh (1991), TIBS, 16, 445-447)
  • TRANSFAC (Heinemeyer et al. (1999), NAR, 27,
    318-322)
  • Prestridge (1995), JMB, 249, 923-932
  • http//bimas.dcrt.nih.gov/molbio/proscan/
  • MatInd and MatInspector
  • Finding consensus matches to known TF binding
    sites
  • Based on TRANSFAC
  • Heinemeyer et al. (1999), NAR, 27, 318-322
  • Quandt et al. (1995), NAR, 23, 4878-4884.
  • http//transfac.gbf.de/TRANSFAC/

72
Promoter recognition (cont.)
  • TSSG/TSSW
  • LDA based combination of several features
    (TATA-box, Inr signal, upstream regions)
  • Solovyev et al. (1997), ISMB, 5, 294-302.
  • http//genomic.sanger.ac.uk/gf/gf.shtml
  • Transcription Element Search Software
  • Identify TF binding sites
  • Based on TRANSFAC
  • http//agave.humgen.upenn.edu/tess/index.html

73
Promoter recognition (cont.)
  • CBS Promoter 2.0 Prediction Server
  • Simulated transcription factors
  • Principles common to neural networks and genetic
    algorithms
  • Knudsen (1999), Bioinformatics 13(5), 356-361.
  • http//genome.cbs.dtu.dk/services/promoter/
  • CorePromoter
  • Position dependent 5-tuple
  • QDA
  • Michael Zhang (1998), Genome Research, 8,
    319-326.
  • http//scislio.cshl.org/genefinder/CPROMOTER/

74
Promoter recognition (cont.)
  • Neural network promoter prediction (NNPP)
  • Time-delay neural network
  • Combining TATA box and initiator
  • Reese (1999), in preparation.
  • http//www-hgc.lbl.gov/projects/promoter.html

75
Example NNPP
76
Promoter recognition (cont.)
  • Markov chain promoter finder
  • Competing interpolated Markov chains for
    promoters, exons, introns
  • Promoter model consists of five states
    representing the core promoter parts
  • Ohler, Reese et al., Bioinformatics 13(5),
    362-369.

77
Splice site prediction
  • Nakata, 1985
  • Nakata (1985), NAR, 13(14), 5327-5340.
  • BCM GeneFinder
  • HSPL - Prediction of splice sites in human DNA
    sequences
  • Triplet frequencies in various functional parts
    of splice site regions
  • Combined with codon statistics
  • Solovyev et al. (1994), NAR, 22(24), 5156-5163.
  • http//genomic.sanger.ac.uk/gf/gf.shtml

78
Splice site prediction (cont.)
  • Neural Network splice site predictor (NNSPLICE)
  • Multi-layered feed-forward neural network
  • Modeled after Brunak et al. (1991), JMB, 220,
    49-65.
  • Reese et al. (1997), JCB, 4(3), 311-323.
  • http//www-hgc.lbl.gov/projects/splice.html
  • NetGene2
  • Combination of neural networks and rule-based
    system
  • Splice site signal neural network combined with
    coding potential
  • Hebsgaard et al. (1996), NAR, 24(17), 3439-3452.
  • Brunak et al. (1991), JMB, 220, 49-65.
  • http//www.cbs.dtu.dk/services/NetGene2/

79
Splice site prediction (cont.)
  • SplicePredictor
  • Logitlinear models for splice site regions
  • Degree of matching to the splice site consensus
  • Local compositional contrast
  • Brendel and Kleffe (1998), NAR, 26(20),
    4748-4757.
  • http//gnomic.stanford.edu/volker/SplicePredictor
    .html

80
Start codon prediction
  • NetStart
  • Trained on cDNA-like sequences
  • Neural network based
  • Local start codon information
  • Global sequence information
  • Pedersen and Nielsen (1997), ISMB, 5, 226-233.
  • http//www.cbs.dtu.dk/services/NetStart/

81
Poly-adenylation signal prediction
  • BCM GeneFinder
  • POLYAH - Recognition of 3'-end cleavage and
    poly-adenylation region
  • Triplet frequencies in various functional parts
    in poly-adenylation regions
  • LDA
  • Solovyev et al. (1994), NAR, 22(24), 5156-5163.
  • http//genomic.sanger.ac.uk/gf/gf.shtml

82
Prediction of coding potential
  • Periodicity detection
  • Coding sequences have an inherent periodicity of
    three
  • Especially good on long coding sequences
  • Auto-correlation
  • Seeking the strongest response when shifted
    sequence is compared with original
  • Michel (1986), J. Theor. Biol. 120, 223-236.
  • Fourier transformation Spectral analysis
  • Detection of peak at position corresponding to
    1/3 of the frequency
  • Silverman and Linsker (1986), J. Theor. Biol.
    118, 295-300.

83
Prediction of coding potential (cont.)
  • Trifonov (19801987)
  • G-notG-U periodicity
  • JMB , 194, 643-652.
  • Fickett (1982)
  • Position asymmetry in the three codon positions
  • NAR 10(17), 5303-5318.
  • Staden (1984)
  • Codon usage in tables
  • NAR 12, 551-567.

84
Prediction of coding potential (cont.)
  • Claverie and Bougueleret (1987)
  • Hexamer frequency differentials
  • NAR 14, 179-196.
  • Fichant and Gautier (1987)
  • Codon usage homogeneity
  • CABIOS, 3(4), 287-295.
  • GRAIL I (1991)
  • Neural network using a shifting fixed size window
  • 7 sensors as input, 2 hidden layers and 1 unit as
    output
  • Uberbacher et al. (1991), PNAS, 88(24),
    11261-11265.

85
Prediction of coding potential (cont.)
  • GeneMark (1986)
  • Inhomogeneous Markov chain models
  • Easy trainable (closed solution for Maximum
    Likelihood)
  • Used extensively in prokaryotic genomes
  • Borodovsky et al. (1993), Computers Chemistry,
    17, 123-133.
  • Glimmer (1998)
  • Interpolated Markov chains from first to eighth
    order
  • Salzberg et al. (1998), NAR, 26(2), 544-548.
  • http//www.tigr.org/softlab/glimmer/glimmer.html

86
Prediction of coding potential (cont.)
  • Review by Fickett (1992)
  • Assessment of protein coding measures, NAR, 20,
    6441-6450.

87
Prediction of coding exons
  • SorFind
  • Detection of spliceable ORFs
  • Hutchinson, NAR, 20(13), 3453-3462.
  • BCM GeneFinder
  • FEXD, FEXN, FEXA, FEXY, FEXH, HEXON
  • LDA
  • Solovyev et al. (1994), NAR, 22(24), 5156-5163.
  • http//genomic.sanger.ac.uk/gf/gf.shtml
  • GRAIL II
  • Exon candidates, heuristic integration, learning
    with neural network
  • Uberbacher et al., Genet. Eng., 16, 241-253.
  • http//compbio.ornl.gov/

88
Integrated gene models LDA/QDA
  • FGene
  • LDA based
  • Dynamic programming for the integration of LDA
    output
  • Solovyev et al. (1995), ISMB, 3, 367-375.
  • http//genomic.sanger.ac.uk/gf/gf.shtml

89
Integrated gene models NN
  • GeneParser
  • Gene-parsing approach
  • Potential alternative splicing recognized
  • Neural network and dynamic programming
  • Snyder and Stormo (1995), JMB, 248, 1-18.

90
Integrated gene models Artificial intelligence
approaches
  • GeneID
  • Rule-based system
  • Homology integration
  • Guigó et al. (1992), JMB , 226, 141-157.
  • http//www1.imim.es/geneid.html
  • GeneID using DP
  • DP to combine a set of potential exons
  • Guigó et al. (1998), JCB , 5, 681-702.

91
Integrated gene models Artificial intelligence
approaches
  • GenLang
  • Syntactic pattern recognition system
  • Formal grammar
  • Tools from computational linguistics
  • Dong and Searls (1994), Genomics, 23,540-551.
  • http//cbil.humgen.upenn.edu/sdong/genlang_home.h
    tml

92
Integrated gene models HMMs
  • HMMGene
  • Several genes per sequence possible
  • User constraints possible
  • Krogh (1997), ISMB, 5, 179-186.
  • http//www.cbs.dtu.dk/services/HMMgene/
  • GeneMark.hmm
  • Based on GeneMark program for bacterial sequences
  • Can predict frame shifts
  • Trained for various organisms
  • Lukashin and Borodovsky (1998), NAR, 26,
    1107-1115.
  • http//genemark.biology.gatech.edu/GeneMark/hmmcho
    ice.html

93
Integrated gene models GHMMs
  • Genie
  • Generalized hidden Markov model with length
    distribution
  • Integration of multiple content and signal
    sensors
  • Content codon statistics, repeats, intron,
    intergenic, database homology hits
  • Signal promoter, start codon, splice sites, stop
    codon
  • Dynamic programming to find optimal parse
  • Several genes per sequence possible
  • Kulp et al. (1996), ISMB, 4, 134-142.
  • Reese et al. (1997), JCB, 4(3), 311-323.
  • http//www.cse.ucsc.edu/dkulp/cgi-bin/genie

94
Example Genie
95
Integrated gene models GHMMs
  • GenScan
  • Multiple content and signal models
  • Semi-hidden Markov model sensors with length
    distribution
  • Takes GC content into account (separate models)
  • Several genes per sequence possible
  • Burge and Karlin (1997), JMB, 268(1), 78-94.
  • http//CCR-081.mit.edu/GENSCAN.html

96
EST/cDNA alignment for gene finding Spliced
alignments
  • PROCRUSTES
  • Spliced alignment algorithm
  • Dynamic programming to combine a set of potential
    exons
  • Frame conservation
  • Homologous sequence needed
  • Gelfand et al. (1996), PNAS, 93, 9061-9066.
  • http//hto-13.usc.edu/software/procrustes/

97
EST/cDNA alignment
  • Sim4
  • Aligns cDNA to genomic sequence
  • Uses local similarity
  • Florea et al. (1998), Genome Research, 8,
    967-974.
  • GeneWise
  • Dynamic programming
  • Partial genes allowed
  • Based on Pfam and statistical splice site models
  • Birney (1999), unpublished
  • http//www.sanger.ac.uk/Software/Wise2

98
EST/cDNA alignment (cont.)
  • ACEMBLY
  • Aligns ESTs to genomic sequence
  • Identifies alternative splicing
  • Integrated in ACeDB
  • Jean Thierry-Mieg (unpublished)

99
Repeat finders
  • Censor
  • Uses database of repeat sequences
  • Jurka et al. (1996), Comp. and Chem., 20(1),
    119-122.
  • BLAST
  • Integrated masking operations
  • XBLAST procedure
  • Claverie (1994), In Automated DNA Sequencing and
    Analysis Techniques, M. D. Adams, C. Fields and
    J. C. Venter, eds., 267-279.
  • http//www.ncbi.nlm.nih.gov/BLAST

100
Repeat finders (cont.)
  • RepeatMasker
  • Detection of interspersed repeats
  • Smit and Green, unpublished results
  • http//ftp.genome.washington.edu/RM/RepeatMasker.h
    tml

101
Homology searching
  • BLAST suite
  • BLASTN, BLASTX, TBLASTX, PSI-BLAST
  • Altschul et al. (1990), JMB, 215, 403-410.
  • http//www.ncbi.nlm.nih.gov/BLAST
  • FASTA suite
  • FASTA, TFASTA
  • Pearson and Lipman (1988), PNAS, 85, 2444-2448.
  • HMM-based searching
  • SAM (UCSC group)
  • http//www.cse.ucsc.edu/research/compbio/sam.html
  • HMMER, Sean Eddy
  • http//hmmer.wustl.edu/

102
Gene family searching
  • BLOCKS
  • http//www.blocks.fhcrc.org
  • PROSITE
  • http//www.expasy.ch/prosite/
  • PFAM
  • http//pfam.wustl.edu/
  • SCOP
  • http//scop.mrc-lmb.cam.ac.uk/scop/

103
The genome annotation experiment (GASP1)
  • Genome Annotation Assessment Project (GASP1)
  • Annotation of 2.9 Mb of Drosophila melanogaster
    genomic DNA
  • Open to everybody, announced on several mailing
    lists
  • Participants can use any analysis methods they
    like (gene finding programs, homology searches,
    by-eye assessment, combination methods, etc.) and
    should disclose their methods.
  • CASP like
  • 12 participating groups

104
URL http//www.fruitfly.org/GASP1
105
Goals of the experiment
  • Compare and contrast various genome annotation
    methods
  • Objective assessment of the state of the art in
    gene finding and functional site prediction
  • Identify outstanding problems in computational
    methods for the annotation process

106
Adh contig
  • 2.9 Mb contiguous Drosophila sequence from the
    Adh region, one of the best studied genomic
    regions
  • From chromosome 2L (34D-36A)
  • Ashburner et al., (to appear in Genetics)
  • 222 gene annotations (as of July 22, 1999)
  • 375,585 bases are coding (12.95)
  • We chose the Adh region because it was thought to
    be typical. A representative test bed to evaluate
    annotation techniques.

107
Adh paper (to appear in Genetics)
URL http//www.fruitfly.org/publications/PDF/ADH.
pdf
108
Raw sequence Adh.fa
  • GAATTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGAC
    GTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCT
    TCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTG
    CCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGC
    CAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTG
    GTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTA
    CTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAA
    CTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGG
    CCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATT
    CCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATAC
    TGAGCCCAAATGAGCGATAGATAGATAGATCGTGCGGCGATCTCGTACTG
    GTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTC
    TGGTTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTC
    TCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTG
    GGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAAC
    GATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTG
    CCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAG
    CTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGG
    ACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGT
    TTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGC
    CGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTAAAGTAACCTGCGGGAA
    TTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAAT
    ACTGAGCCCAAATGAGCGATAGATAGATAGATCGTGCGGCGATCTCGTAC
    TGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTT
    TCTGGTTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGT
    TCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTG
    TGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAA
    ACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACT
    TGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGC
    AGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATG
    GGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCT
    GTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTT
    GCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCC
    CTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCG
    GCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGT
    ATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCAT
    CGAAAAGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTC
    TGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGG
    CGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGA
    TTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCC
    CTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCT
    GGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGAC
    CTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTT
    GACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGG
    TTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGG
    ACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTC
    CTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGT
    TGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACG
    GCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTG
    TGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCG
    TACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACA
    AACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGT
    GGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAA
    TTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAAT
    ACTGTGCGGCGATCTCGTACTGGACGGAAATGTCAGGAGATAGGAGAAGA
    AAA

109
Drosophila data sets provided to participants
  • Curated Drosophila nuclear DNA "coding sequences"
    (CDS)
  • Curated non-redundant Drosophila genomic DNA data
    (275 multi- and 144 single-exon sequence
    entries from Genbank)
  • Drosophila 5' and 3' splice sites
  • Drosophila start codon sites
  • Drosophila promoter sequences
  • Drosophila repeat sequences
  • Drosophila transposon sequences
  • Drosophila cDNA sequences
  • Drosophila EST sequences

URL http//www.fruitfly.org/GASP1/data/data.html
110
Timetable
  • May 13, 1999 - June 30, 1999
  • Distribution of the sample sequence and
    associated data to the predictors. Collection of
    predictions.
  • June 30, 1999 - July 31, 1999
  • Evaluation of the predictions by the Drosophila
    Genome Center.
  • August 4, 1999
  • External expert assessment of the prediction
    results (HUGO meeting, EMBL)
  • August 6, 1999
  • Tutorial 3 at the ISMB 99 conference in
    Heidelberg, Germany

111
Resources for assessing predictions
  • 80 cDNA sequences NOT in Genbank before
    experiment deadline
  • Sequenced from 5 different cDNA libraries
  • 3 paralogs to other genes in the genome
  • 19 cDNAs with cloning artifacts
  • 2 apparently representing unspliced RNA
  • Multiple inserts (2 cDNAs cloned in the same
    vector)
  • 58 usable cDNAs
  • 33 cDNA sequences in Genbank during experiment
  • Annotations from Adh paper

112
Curated data sets for assessing predictions
  • Standard 1 (Adh.std1.gff) conservative gene set
  • 43 gene structures (7 single- and 36 multi-
    coding exon genes)
  • Criteria for inclusion
  • gt95 (most gt99) of the cDNA aligned to genomic
    DNA (using sim4)
  • GT/AG splice site consensus sequences
  • Splice site score from neural net
  • 5 splice sites gt0.35 threshold ( 98 True
    Positive score)
  • 3 splice sites gt0.25 threshold ( 92 True
    Positive score)
  • Start codon and stop codon annotations from
    Standard 3 (derived from Adh paper)
  • These 43 genes represent typical genes

113
Curated data sets for assessing predictions
  • Standard 2 (Adh.std2.gff)
  • Superset of Standard 1
  • 15 additional gene structures
  • Same alignment criteria as Standard 1 but no
    splice site consensus requirement
  • Not used in the experiment

114
Curated data sets for assessment
  • Standard 3 (Adh.std3.gff) more complete gene
    set
  • 222 gene structures (39 single- and 183 multi-
    coding exon genes)
  • Criteria
  • Annotated as described in Ashburner et al.
  • cDNA to genomic alignment using sim4
  • Start codons predicted by ORFFinder (Frise et
    al., unpublished)
  • 182 genes have similarity to a homologous
    protein sequence in another organism or have a
    Drosophila EST hit
  • Edge verification by partial EST/cDNA alignments
  • BLASTX, TBLASTX homology results
  • PFAM alignments
  • Gene structure verification using GenScan (human)
  • 14 genes had EST/homology hits but no gene
    finding predictions
  • 40 genes only have strong GenScan predictions

115
Submission format
  • GFF (Durbin and Haussler, 1998, unpublished)
  • http//www.sanger.ac.uk/Software/GFF/

116
Sample submission
organism Drosophila melanogaster
std1 Adh std1 TFBS 32002
32006 . . Adh
std1 TATA_signal 32009 32012 .
. transcript "1" Adh std1
TSS 32033 32034 . .
transcript "1" Adh std1
prim_transcript 32034 33122 . .
transcript "1" Adh std1 exon
32034 32277 . .
transcript "1" Adh std1 start_codon
32122 32124 . .
transcript "1" Adh std1 CDS
32122 32277 . .
transcript "1" Adh std1 splice5
32277 32278 . .
transcript "1" Adh std1 splice3
32332 32333 . .
transcript "1" Adh std1 exon
32785 32830 . .
transcript "1" Adh std1 CDS
32785 32830 . .
transcript "1" Adh std1 splice5
32830 32831 . .
transcript "1" Adh std1 splice3
32825 32826 . .
transcript "1" Adh std1 CDS
32826 33003 . .
transcript "1" Adh std1 exon
32826 33122 . .
transcript "1" Adh std1 stop_codon
33001 33003 . .
transcript "1" Adh std1 polyA_signal
33090 33095 . .
transcript "1" Adh std1 polyA_site
33101 33102 . .
transcript "1" Adh std1
prim_transcript 38100 41973 . - .
transcript "2" Adh std1 exon
38100 41973 . - .
transcript "2" Adh std1 polyA_site
39620 39621 . - .
transcript "2" Adh std1 polyA_signal
39685 39690 . - .
transcript "2" Adh std1 stop_codon
40125 40127 . - .
transcript "2" Adh std1 CDS
40125 40390 . - .
transcript "2" Adh std1 start_codon
40388 40390 . - .
transcript "2" Adh std1 TSS
41973 41974 . - .
transcript "2" Adh std1 TATA_signal
41998 42001 . - .
transcript "2" Adh std1 TFBS
42187 42193 . - .
Adh std1 TFBS 42211 42216 . -
.
Gene 1
Gene 2
117
Submissions
  • MAGPIE Team
  • Credit
  • Terry Gaasterland, Alexander Sczyrba, Elizabeth
    Thomas, Gulriz Kurban, Paul Gordon, Christoph
    Sensen
  • Laboratory for Computational Genomics,
    Rockefeller and Institute for Marine Biosciences,
    Canada
  • Method
  • Automatic genome analysis system integrating
    Drosophila Genscan predictions, confirming exons
    boundaries using database searches, repeat
    finding (Calypso, REPupter) and gene function
    annotations.

118
Submissions (cont.)
  • References
  • Multigenome MAGPIE poster at ISMB 99.
  • Gaasterland and Ragan (1998), J. of Microbial and
    Comparative Genomics, 3, 305-312.
  • Gaasterland and Sensen (1996), Biochimie 78,
    302-310.
  • REPupter Kurtz and Schleiermacher (1999),
    Bioinformatics 15(5), 426-427.

119
Submissions (cont.)
  • Computational Genomics Group, The Sanger Centre
  • Credit
  • Victor Solovyev, Asaf Salamov
  • Method
  • Discriminant analysis based gene prediction
    programs FGenes (trained for Human) and FGenesH
    (trained for Drosophila) Combining the output of
    Fgenes, FGenesH and BLAST using FGenesH. 3
    different threshold annotations are submitted.
  • The programming running time is linear with the
    sequence length.
  • Automatic, plus additional user interactive
    screening.
  • Non-redundant NCBI database used for BLAST.
  • URL/References
  • http//genomic.sanger.ac.uk/gf/gf.shtml

120
Submissions (cont.)
  • Genome Annotation Group, The Sanger Centre
  • Credit
  • Ewan Birney
  • Method
  • Protein family based gene identification using
    Wise2 (previously Genewise) and PFAM.
  • URL
  • http//www.sanger.ac.uk/Software/Wise2

121
Submissions (cont.)
  • Pattern Recognition, The University of Erlangen
  • Credit
  • Uwe Ohler, Georg Stemmer, Stefan Harbeck,
    Heinrich Niemann
  • Method
  • Promoter recognition based on interpolated Markov
    chains Genscan like promoter model
    (MCPromoter) maximal mutual information based
    estimation of interpolated Markov chains.
  • Automatic.
  • Promoter training data set from
    http//www.fruitfly.org/data/geneset
    s

122
Submissions (cont.)
  • References
  • Ohler, Harbeck, Niemann, Noeth and Reese (1999),
    Bioinformatics 15(5), 362-369.
  • Ohler, Harbeck and Niemann (1999), Proc.
    EUROSPEECH, to appear.
  • URL
  • http//www5.informatik.uni-erlangen/HTML/English/R
    esearch/Promoter

123
Submissions (cont.)
  • Computational Biosciences, Oakridge National
    Laboratory
  • Credit
  • Richard J. Mural, Douglas Hyatt, Frank Larimer,
    Manesh Shah, Morey Parang
  • Method
  • Integrated neural network based system including
    gene assembly using EST and homology information
    (GRAILexp).
  • URL
  • http//compbio.ornl.gov/droso

124
Submissions (cont.)
  • Center for Biological Sequence Analysis,
    Technical University of Denmark
  • Credit
  • Anders Krogh
  • Method
  • Modular HMM incorporating database hits (proteins
    and ESTs/cDNAS) and other external information
    probabilistically (HMMGene) the HMM has modules
    for coding regions, splice sites, translation
    start/stop, etc..
  • It will be a fully automated system.
  • Trained on Drosophila data
  • http//www.fruitfly.org/GSAC1/data/data.html
  • and
  • Victor Solovyev (personal communication)

125
Submissions (cont.)
  • References
  • Krogh (1998), In S.L. Salzberg et al., eds.,
    Computational Methods in Molecular Biology,
    45-63, Elsevier.
  • Krogh (1997), Gaasterland et al., eds., Proc.
    ISMB 97, 179-186.
  • http//www.cbs.dtu.dk/krogh/refs.html
  • URL
  • http//www.cbs.dtu.dk/services/HMMgene/
  • Not yet for Drosophila.

126
Submissions (cont.)
  • BLOCKS group, Fred Hutchinson Cancer Research
    Center in Seattle, Washington
  • Credit
  • Jorja Henikoff, Steve Henikoff
  • Method
  • DNA translation in 6 frames and search against
    BLOCKS and against BLOCKS extracted from
    Smart3.0 (http//coot-embl-heidelberg.de/SMART/)
    using BLIMPS automatic post-processing to join
    multiple predictions from the same block.
  • Automatic with some user interactive screening of
    results.

127
Submissions (cont.)
  • References
  • Henikoff, Henikoff and Pietrokovski (1999), Nucl.
    Acids Res., 27, 226-228.
  • Henikoff and Henikoff (1994), Proc. 27th Ann.
    Hawaii Intl. Conf. On System Sciences, 265-274.
  • Henikoff and Henikoff (1994), Genomics, 19,
    97-107.
  • URL
  • http//blocks.fhcrc.org
  • http//blocks.fhcrc.org/blocks-bin/getblock.sh?ltbl
    ock namegt

128
Submissions (cont.)
  • Genome Informatics Team, IMIM, Barcelona, Spain
  • Credit
  • Roderic Guigó, Josep F. Abril, Enrique Blanco,
    Moises Burset, Genis Parra
  • Method
  • Dynamic programming based system to combine
    potential exon candidates modeled as a fifth
    order Markov model and functional sequence sites
    modeled as a position weight matrix (Geneid
    version 3).
  • Fully automatic, very fast.
  • Trained on Drosophila data
  • http//www.fruitfly.org/GSAC1/data/data.html

129
Submissions (cont.)
  • References
  • Guigó et al. (1998), JCB , 5, 681-702.
  • URL
  • Information on training process
  • http//www1.imim.es/rguigo/AnnotationExperiment/i
    ndex.html
  • http//www1.imim.es/geneid.html

130
Submissions (cont.)
  • Mark Borodovsky's Lab, School of Biology, Georgia
    Institute of Technology
  • Credit
  • Mark Borodovsky, John Besemer
  • Method
  • Markov chain models combined with HMM technology
    (Genemark.hmm).
  • URL
  • http//genemark.biology.gatech.edu/GeneMark/hmmcho
    ice.html

131
Submissions (cont.)
  • Biodivision, GSF Forschungszentrum für Umwelt und
    Gesundheit, Neuherberg, Germany
  • Credit
  • Matthias Scherf, Andreas Klingenhoff, Thomas
    Werner
  • Method
  • Universal sequence classifier which is based on a
    correlated word analysis to predict initiators
    and promoter associated TATA boxes (CoreInspector
    V1.0 beta). Sequences of 100 bp are classified at
    once.
  • Trained on Eukaryotic Promoter Database (EPD
    version 5.9).
  • Fully automatic, 2 seconds per 1Kb.
  • References
  • Scherf et al. (1999), in prepa
Write a Comment
User Comments (0)
About PowerShow.com