Hidden Markov Models and Gene Finding - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Hidden Markov Models and Gene Finding

Description:

Demonstrate its application in gene finding by reviewing two ... Other Gene finders, e.g. Genie, also use this model, however GENSCAN differs from them. ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 38
Provided by: coli58
Category:

less

Transcript and Presenter's Notes

Title: Hidden Markov Models and Gene Finding


1
Hidden Markov Models and Gene Finding
  • Temidayo Ajayi
  • Electrical Engineering and Computer Science
  • dajayi_at_ku.edu

2
Brief Overview
  • Todays goals
  • Introduce the concept of Hidden Markov Models as
    a general tool used in bioinformatics
  • Demonstrate its application in gene finding by
    reviewing two literature articles

3
Learning Objectives
  • At the end of my talk you should have
  • A knowledge of the terms used in this area of
    Bioinformatics and Machine Learning
  • A genral understanding of HMM
  • A knowledge of what Gene Finding is
  • An introduction to two kinds of Gene Finders

4
Outline
  • Introduction of terminology
  • Brief intro of HMM
  • What is it?
  • How is it used?
  • Advantages and Disadvantages
  • Approaches to Gene Finding
  • Application of HMM Gene Finding

5
Outline (contd)
  • What is Gene Finding?
  • Why is it studied?
  • Analysis of these approaches
  • Sample execution of a gene finder program
  • Conclusion
  • Questions / Discussion

6
Explanation of Terminology
  • Base pair (bp) A-T or G-C pairs in the DNA of an
    organism
  • Introns non-coding regions within a gene
  • Exons coding regions
  • Open Reading Frame (ORF) Coding region of DNA
  • Codons mark the beginning and end of an ORF
  • GC content - usually expressed as a percentage,
    it is the proportion of GC-base pairs in the DNA
    molecule or genome sequence being investigated
  • Acceptor Exon-Intron boundary (EI)
  • Donor Intron-Exon boundary (IE)

7
Explanation of Terms (contd)
  • Splice sites acceptors and donors

8
Explanation of Terms (contd)
  • Intergenic Region a stretch of DNA sequences
    located between clusters of genes that comprise a
    large percentage of the human genome but contain
    few or no genes.

9
The Hidden Markov Model (HMM)
  • A finite set of states, each of which is
    associated with a (generally multidimensional)
    probability distribution .
  • Transitions among the states are governed by a
    set of probabilities called transition
    probabilities.
  • In a particular state an outcome or observation
    can be generated, according to the associated
    probability distribution.
  • It is only the outcome, not the state, visible to
    an external observer and therefore states are
    hidden to the outside, hence the name Hidden
    Markov Model

10
Problems to be solved by the HMM
  • Three canonical problems
  • Given the model parameters, compute the
    probability of a particular output sequence.
    Solved by the forward algorithm
  • Given the model parameters, find the most likely
    sequence of (hidden) states which could have
    generated a given output sequence. Solved by the
    Viterbi algorithm
  • Given an output sequence, find the most likely
    set of state transition and output probabilities.
    Solved by the Baum-Welch algorithm

11
HMM Sample Structure
  • Model is a linear sequence of nodes
  • Squares matches
  • Diamonds insertions
  • Circles - deletions

12
Why HMMs might be a good fit for Gene Finding
  • Classification Classifying observations within a
    sequence
  • Order A DNA sequence is a set of ordered
    observations
  • Grammar / Architecture The eukaryotic cell
    structure contains needed information
  • Success measure number of complete exons
    correctly labeled
  • Training data Available from various genome
    annotation projects

13
HMM Advantages
  • Statistical Grounding
  • HMMs have a strong mathematical structure and
    hence can form the theoretical basis for use in a
    wide range of applications
  • Modularity
  • HMMs can be combined into larger HMMs
  • Transparency of the Model
  • Assuming an architecture with a good design
  • People can read the model and make sense of it
  • The model itself can help increase understanding
    of the original data

14
HMM Advantages (contd)
  • Incorporation of Prior Knowledge
  • Incorporate prior knowledge into the architecture
  • Initialize the model close to something believed
    to be correct
  • Use prior knowledge to constrain training process

15
How does Gene Finding make use of HMM advantages?
  • Statistics
  • Many systems alter the training process to better
    suit their success measure
  • Modularity
  • Almost all systems use a combination of models,
    each individually trained for each gene region
  • Prior Knowledge
  • A fair amount of prior biological knowledge is
    built into each architecture

16
HMM Disadvantages
  • Markov Chains
  • States are supposed to be independent
  • P(y) must be independent of P(x), and vice versa
  • This usually is not true
  • Can get around it when relationships are local

P(x)
P(y)

17
HMM Disadvantages (contd)
  • Some classic Machine Learning Problems
  • Watch out for local maxima
  • Model may not converge to a truly optimal
    parameter set for a given training set
  • SP.E.ED
  • Due to exhaustive enumeration and expansion of
    all possible paths through the model

18
HMM Overview
  • Advantages
  • Mathematical Grounding
  • Modularity
  • Transparency
  • Prior Knowledge
  • Disadvantages
  • State independence
  • Local Maximums
  • Speed

19
Approaches to Gene Finding
  • Might need to look at genes we have seen before
  • Search Known Databases
  • Homology-based gene identification
  • Might need to find genes we know nothing about
    (Ab initio)
  • Use purely computational methods
  • HMM
  • Directed Acyclic Graphs
  • Weighed Matrix Methods

20
Gene Finder GENSCAN
  • Prediction of Complete Gene Structures in Human
    Genomic DNA Burge and Karlin
  • Introduce a general probabilistic model for the
    gene structure of human genomic sequences and
    describe its application to gene finding in
    GENSCAN
  • GENSCAN uses a three-periodic fifth-order Markov
    model of coding regions rather than using
    specialized models of particular protein motifs
    or data base homology information
  • Other Gene finders, e.g. Genie, also use this
    model, however GENSCAN differs from them. HOW?

21
GENSCAN Distinguishing Factors
  • Use of an explicitly double-stranded genomic
    sequence model in which potential genes occurring
    on both DNA strands are analyzed in simultaneous
    and integrated fashion
  • Flexibility of model to contain a partial gene, a
    complete gene, or multiple complete or partial
    genes, or no gene at all!
  • A novel (as of 1997) method Maximum Dependence
    Decomposition to model functional signals in DNA
    (or protein) sequences which allows for
    dependencies between signal positions in a fairly
    natural and statistically justifiable way

22
GENSCAN Comparing other Gene Finders
  • Sn Sensitivity
  • Sp Specificity
  • Ac Approximate Correlation
  • ME Missing Exons
  • WE Wrong Exons
  • GENSCAN Performance Data, http//genes.mit.edu/Acc
    uracy.html

23
GENSCAN Discussion
  • Novel features of the model include
  • Use of distinct explicit empirically-derived sets
    of model parameters to differentiate between gene
    structure and composition between distinct
    isochores of the human genome

24
GENSCAN Discussion (contd)
  • Capacity to predict multiple genes in a sequence,
    to deal with partial as well as complete genes,
    and to predict consistent genes occurring on
    either or both DNA Strands
  • New statistical models of donor and acceptor
    splice sites which capture potentially important
    dependencies between signal positions

25
Gene Finder ExonHunter
  • ExonHunter A comprehensive approach to gene
    finding Brejova et al.
  • Method gathers numerous sources of information
  • Genomic sequences
  • Expressed Sequence Tags
  • Protein Databases
  • All information is combined into a gene finder
    based on a hidden Markov model in a novel and
    systematic way.
  • Earlier successes of GENSCAN segued into
    comparative approaches to gene finding
  • Experiments show that no one information source
    alone is sufficient tyo achieve the same
    performance as their combination

26
Gene Finder ExonHunter
  • An HMM for gene finding defines a conditional
    probability distribution over all possible
    annotations (sequences of labels) of a specific
    sequence
  • The model utilizes advisors to represent
    supplementary information
  • For each position in the sequence, an advisor
    specifies a probability distribution over
    annotation labels

27
GeneFinder ExonHunter
  • Next, all advisors are then combined into a
  • SUPERADVISOR
  • The superadvisor prediction at a particular
    position is a probability distribution over all
    labels x (x1, , xn), where xi is the
    probability of the ith label from ?, given all
    advice
  • The superadvisor is finally combined with an HMM

28
ExonHunter Distinguishing Features
  • GC Content Model transition and emission
    probabilities depend on GC content level,
    estimated from a 1000 bp window around the
    current position.
  • Signal Models Use of higher order trees (HOT)
    of order 2 to model acceptors and donor site
    signals.

29
ExonHunter Distinguishing Features
  • Length distributions - divided into head with
    arbitrary distribution, as well as a
    geometrically decaying tail.

30
ExonHunter Experimental Results
31
ExonHunter Conclusion
  • Model based on probabilistic statements made
    using various sources of information, called
    advisors
  • A quadratic programming-based method that
    extended a traditional linear combination
    approach and adapted the Viterbi algorithm to the
    domain
  • ExonHunter outperforms several other programs
    like SLAM and TWINNSCAN

32
ExonHunter A trip the Home site
  • A brief demonstration of a run of ExonHunter, at
  • http//software.bioinformatics.uwaterloo.ca/ex
    onhunter/

33
Outlook The future of Gene Finding
  • A shift from pattern recognition to database
    searching and information integration.
  • Computational methods will still be necessary for
    other organisms.
  • As tools become better, faster, and complete, the
    questions to be asked become more interesting
    recall GenScan started with small genomic
    contigs, Exon-Hunter was able to combine much
    more data. Therefore, questions will tend to be
    more genome-based, than sequence-based

34
HGP Timeline The National Genome Research
Institute http//www.genome.gov/11007154
35
Conclusion
  • The Hidden Markov Model, HMM, is a finite set of
    states that implements transitions based on a
    probability distribution
  • Strong Mathematical grounding
  • Modular
  • Transparent
  • Might encounter local maxima
  • Slow

36
Conclusion
  • GENSCAN
  • Ab Initio approach to gene finding
  • Features novel algorithm to allow flexibility
    with input sequences
  • Performs well with small genomic contigs
  • ExonHunter
  • Ab Inition approach to gene finding
  • Combines multiple sources of information into HMM
  • Uses advisors to make superior decisions

37
Questions
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com