Computational Gene Finding presentation

About This Presentation

Transcript and Presenter's Notes

Title: Computational Gene Finding

1
Computational Gene Finding
CIS786 Intro to Comp Biol Instructor Dr. Barry
Cohen

Greg Voronin
Hui Zhao
Xueyi(Judy) Xiao

2
The Challenge
Presented By Greg Voronin

Generate predictions of gene locations from
primary genomic sequence by computational means
Two principle means
Database searching
Statistical Methods

3
The Biological Model
4
The Computational Model

Representing the biology in a framework amenable
to mathematical/statistical methods
Exon classification, sequence features, signal
profiles
What is an exon and what properties does the
sequence of an exon hold?
How is an exon recognized and processed?

5
Exon Classification Scheme
6
The Nature of The Data

What is the primary genomic sequence?
Nor is the available sequence a single
continuous and exact sequence for each
chromosome the HGP is represented by a set
of sequences that cover the genome is a
statistical sense but have a very large number of
gaps.
Many genes are as large or larger than the
contigs in the HGP
Finding genes will depend on the accuracy of the
scaffold of their contigs

7
Back to Beginning

What is a gene?
A biological model, a mathematical model and
computational representation

The programs we evaluate take these factors into
account in their underlying model
8
MZEF

Michael Zhangs Exon Finder
Utilizes quadratic discriminant analysis (QDA) to
classify sequence into gene and non-gene groups
QDA is a multivariate statistical pattern
recognition method
Draws a curved boundary between groups of
different classes

9
QDA
10
Key Elements of QDA

Entities are represented by an n-dimensional
vector of feature values
Two classes of entities are categorized by their
respective multinormal distribution
Each class has its own mean vector
The mean of each feature
An appropriate distance function is central to
the calculation of the posterior probabillity of
group membership of a given unknown entity given
its specific feature vector.

11
Mahalanobis Distance

The actual posterior probabillity function is
more complex, but this is the distance component

( x mi )T Si-1 ( x mi )
12
MZEF Specifics

MZEF uses the following features
Exon length, exon-intron transition, branch site
score, 3ss score, exon score, strand score,
frame score, 5ss score, intron-exon transition
9 dimensional feature vector
Training sets of known exons and non-exons are
used to establish the class characterisitics
Supervised learning

13
GATC to Gene

Cells recognize genes from DNA sequence.

Can we??
The Hidden Markov Model Method
HMMgene Presented By Hui Zhao
14
HMMs are Statistical Models

Definition
Any mathematical construct that attempts to
parameterize a random process
Example A normal distribution
Assumptions
Parameters
Estimation
Usage
HMMs are just a little more complicated

15
Primary HMM Assumptions

Observations are ordered
Random processes can be represented by a
stochastic finite state machine with emitting
states
transition probabilities and emission
probabilities.

16
How do we find the model probabilities?

This is called training
We start with an architecture and a set of
observed sequences
The training process iteratively alters its
parameters to fit the training set
The trained model will assign the training
sequences high probability
but can it generalize?

17
HMM Usage two major tasks

Evaluate the probability of an observed sequence
given the model (Forward)
Find the most likely path through the model for a
given observation sequence (Viterbi)

18
Gene Finding An Ideal HMM Application

Our Objective
To find the coding and non-coding regions of an
unlabeled string of DNA nucleotides
Our Motivation
Assist in the annotation of genomic data produced
by genome sequencing methods
Gain insight into the mechanisms involved in
transcription, splicing and other processes

19
Why HMMs might be a good fit for Gene Finding

The observations within a sequence are ordered
A DNA sequence is a set of ordered observations
Designing the architecture is straight forward
Easy to measure success
Training data is available from various genome
annotation projects

20
A HMM genefinder

States represent standard gene features
intergenic region, exon, intron, perhaps more
(promotor, 5UTR, 3UTR, Poly-A,..).
Observations are things like state-dependent
base composition.
In a HMM, length of each state must be included
as well.
Finally, reading frames and both strands must be
dealt with.

21
Several problems can occur
5
correct gene structure extended exon missing
exon additional exon missing intron extended gene
model
22
HMMgene

Krogh (1997) In Proc. 5th Conf. Intel. Sys. Mol
Biol. pp179-186

23
HMMGene

Uses an extended HMM called a CHMM
CHMM HMM with classes
Takes full advantage of being able to modify the
statistical algorithms
Uses high-order states
Trains everything at once

24
How does HMMGene work?
1) 5th order HMM assumes P(xi xi-1,xi-2,
xi-3, xi-4, xi-5) is different in Introns, Exons,
etc..
2) Construct the model
25
2. How does HMMGene work?
4) Use Viterbi (n-best) to find a path
through the CHMM a labeled gene
5) Use the forward algorithm to measure P(gene
model) using n-best.
26
A DNA sequence containing one gene. For each
nucleotide its label is written below. The coding
regions are labeled C, the introns I, and the
intergenic regions 0. HMMGene calls these
class labels in a CHMM.
27
HMMGene

Does not use the standard ML method which
optimizes the probability of the observed
sequence instead it maximizes the probability
of the correct prediction.
Only one conference paper describes the
algorithm. There is a web site to run the
algorithm, and it's performance has been compared
to other algorithms.
No complete description of the algorithm is
available in the 1997 paper the author states
" the details of HMMGene will be described
elsewhere (in prep)" but unfortunately the
detailed paper has not been published.

28

HMMgene http//www.cbs.dtu.dk/services/HMMgene/)
29
HMMgene and HMM Disadvantages

Markov Chains
States should be independent
P(y) must be independent of P(x) -usually not
true
Local maxima
Model may not converge the optimal parameter set
Over-fitting
More training is not always good-set may be too
small

30
Summary

HMMgene finds whole genes in anonymous DNA with
correctly spliced exons.
It can predict several whole or partial genes in
one sequence.
If some features of a sequence are known, such as
hits to ESTs, proteins, or repeat elements, these
regions can be locked as coding or non-coding and
then the program will find the best gene
structure under these constraints.

31
GENSCAN (v1.0)
Presented By Xueyi (Judy) Xiao

A computer program identifying complete exon
intron structures of genes in genomic DNA.
Developed by Chris Burge (Burge 1997), in the
research group of Samuel Karlin, Dept of
Mathematics, Stanford Univ.
Original server _at_Stanford ? New server _at_MIT
(seq_len lt 500 kb)
Servers are also maintained by the Pasteur
Institute, Paris and by the GENSCAN web server at
DKFZ/EMBnet, Heidelberg
Implementations
web server http//genes.mit.edu/GENSCAN.html
email server http//genes.mit.edu/GENSCANM.html
local copy downloaded under a license agreement

32
How does It Work?

Designed to predict complete gene structures
Introns and exons
Promoter sites
Polyadenylation signals
Larger predictive scope
Partial and Complete genes
Multiple genes separated by intergenic DNA in a
seq
Consistent sets of genes on either/both DNA
strands
Not use similarity-based methods
Based on a general probabilistic model of genomic
sequences composition and gene structure

33
Model of Genomic Sequence Structure
Fig. 3, Burge and Karlin 1997
34
Input
http//genes.mit.edu/GENSCAN.html
35
Output
36
Graphic View
Optimal Exon Suboptimal Exon
Initial Exon
Internal Exon
Terminal Exon
Single-Exon gene
37
Is It Good?

Accuracy
Substantially higher accuracies when tested on
standardized sets of human vertebrate genes,
with 75-80 of exons identified exactly.
Reliability
Able to indicate fairly accurately the
reliability of each predicted exon.
Consistency
Consistently high levels of accuracy, for seqs
of differing CG content and for distinct groups
of vertebrates.

38
Why not Perfect?

Gene Number
usually approximately correct, but may not
Organism
primarily for human/vertebrate seqs maybe lower
accuracy for non-vertebrates. Glimmer
GeneMark for prokaryotic or yeast seqs
Exon and Feature Type
Internal exons gt Initial or Terminal exons
Exons gt Polyadenylation or Promoter
signals(NNPP)
Biases in Test Set
The Burset/Guigó (1996) dataset
toward short genes with relatively simple
exon/intron structure
The Rogic (2001) dataset
DNA seqs GenBank r-111.0 (04/1999 lt- 08/1997)
source organism specified
consider genomic seqs containing exactly one
gene
seqsgt200kb were discarded mRNA seqs and seqs
containing pseudo genes or alternatively spliced
genes were excluded.

39
What are They doing NOW?

The research group _at_MIT
is currently developing another program,
GenomeScan, which is more accurate
when a moderate or closely related
protein seq is available.

40
(No Transcript)
41
TEST OF METHODS

Sample Tests reported by Literature
Test on the set of 570 vertebrate gene seqs
(BursetGuigo 1996) as a standard for comparison
of gene finding methods.
Test on the set of 195 seqs of human, mouse or
rat origin (named HMR195) (Rogic 2001).
Self-Test done by our group
Dataset Intron-less(Single-exon),
-rich(Multi-exon), -poor(Random)
Organism Human
Methods all of the three
Steps

42
Where to get the dataset for Self-Test?
http//www.ncbi.nlm.nih.gov/genome/guide/human/
43
Accuracy Measures
Sensitivity vs. Specificity (adapted from
BursetGuigo 1996)
44
Results Accuracy Statistics

Table Relative Performance (adapted added from
Rogic 2001)

of seqs - number of seqs effectively analyzed
by each program in parentheses is the number of
seqs where the absence of gene was predicted Sn
-nucleotide level sensitivity Sp - nucleotide
level specificity CC - correlation coefficient
ESn - exon level sensitivity ESp - exon level
specificity
45
Testing Random Sequences
Presented By Greg Voronin

These gene finding programs model statistical
trends and properties
Can they be fooled by random sequences
Generate a preliminary measure of accuracy
Java program written to generate random
sequences of a,t,g,c
3 groups of sequences 5k, 10k 30K
Sent to BLAST then GeneMachine

46
Testing Results

BLAST
bit score E-value
5k 42 5.7
10k 44 3.0
30k 42 8.7
GeneMachine
5k 10k 30K
MZEF 1 5 14
GenScan 3 11 26
HMMgene 7 11 42

47
New directions
Presented By Hui Zhao

Computational Gene Finding has rapidly evolved
since it started 20 years ago.
The advent of full-length genomic sequences has
provided data and increased the requirements.
Gene annotation has direct medical implications
on the design of pharmaceuticals and the
understanding of the genetic component of
diseases.
Gene finding remains largely an unsolved problem.

48
New directions

The growing quantities of training data for the
models should improve their performance.
Algorithms that combine the inputs from several
models in a weighted voting scheme should be
considered to try to get the best from all of the
methods.
Many other AI approaches can be used to meet this
challenge including decision trees, neural
networks and rule-based systems

49
Challenges and Discoveries Ahead

Eukaryotic gene finding continues to be an active
and important area more research is required
into algorithms with greater accuracy
Expertise in computational biology is also
required which means training in both computer
science and molecular biology
More classes like this

Write a Comment

User Comments (0)

About PowerShow.com

Computational Gene Finding PowerPoint PPT Presentation