Hidden Markov Models in Bioinformatics - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Hidden Markov Models in Bioinformatics

Description:

Hidden Markov Models in Bioinformatics Example Domains: Gene Finding & Protein Family Modeling 5 Second Overview Today s goal: Introduce HMMs as general tools in ... – PowerPoint PPT presentation

Number of Views:187
Avg rating:3.0/5.0
Slides: 35
Provided by: Colin275
Category:

less

Transcript and Presenter's Notes

Title: Hidden Markov Models in Bioinformatics


1
Hidden Markov Models in Bioinformatics
  • Example Domains Gene Finding
  • Protein Family Modeling

2
5 Second Overview
  • Todays goal Introduce HMMs as general tools in
    bioinformatics
  • I will use the problem of Gene Finding as an
    example of an ideal HMM problem domain
  • I will use the problem of Protein Family Modeling
    as an example of a clever way to fit HMMs to a
    problem

3
Learning Objectives
  • When Im done you should know
  • When is an HMM a good fit for a problem space?
  • What materials are needed before work can begin
    with an HMM?
  • What are the advantages and disadvantages of
    using HMMs?

4
Outline
  • HMMs as Statistical Models
  • The example tasks at a glance
  • Good problems for HMMs
  • HMM Advantages
  • HMM Disadvantages
  • Gene Finding Examples

5
Statistical Models
  • Definition
  • Any mathematical construct that attempts to
    parameterize a random process
  • Example A normal distribution
  • Assumptions
  • Parameters
  • Estimation
  • Usage
  • HMMs are just a little more complicated

6
HMM Assumptions
  • Observations are ordered
  • Random process can be represented by a stochastic
    finite state machine with emitting states.

7
HMM Parameters
  • Using weather example
  • Modeling daily weather for a year
  • Ra Ra Su Su Su Ra..
  • Lots of parameters
  • One for each table entry
  • Represented in two tables.
  • One for emissions
  • One for transitions

8
HMM Estimation
  • Called training, it falls under machine learning
  • Feed an architecture (given in advance) a set of
    observation sequences
  • The training process will iteratively alter its
    parameters to fit the training set
  • The trained model will assign the training
    sequences high probability

9
HMM Usage
  • Two major tasks
  • Evaluate the probability of an observation
    sequence given the model (Forward)
  • Find the most likely path through the model for a
    given observation sequence (Viterbi)

10
Gene Finding(An Ideal HMM Domain)
  • Our Objective
  • To find the coding and non-coding regions of an
    unlabeled string of DNA nucleotides
  • Our Motivation
  • Assist in the annotation of genomic data produced
    by genome sequencing methods
  • Gain insight into the mechanisms involved in
    transcription, splicing and other processes

11
Gene Finding Terminology
  • A string of DNA nucleotides containing a gene
    will have separate regions (lines)
  • Introns non-coding regions within a gene
  • Exons coding regions
  • Separated by functional sites (boxes)
  • Start and stop codons
  • Splice sites acceptors and donors

12
Gene Finding Challenges
  • Need the correct reading frame
  • Introns can interrupt an exon in mid-codon
  • There is no hard and fast rule for identifying
    donor and acceptor splice sites
  • Signals are very weak

13
Protein Family Modeling (A clever fit of HMMs)
  • I have a protein sequence.
  • What family is it in?
  • Can you give me a quick alignment to the other
    members of the family?
  • These amino acids here, do they match the
    families consensus positions, or are they inserts?

14
Profile HMM
  • Square Match (consensus) state
  • Diamond Insert state notice the loops
  • Circle Delete state allows you to jump a
    match

15
What makes a good HMM problem space?
  • Characteristics
  • Classification problems
  • There are two main types of output from an HMM
  • Scoring of sequences
  • (Protein family modeling)
  • Labeling of observations within a sequence
  • (Gene Finding)

16
HMM Problem CharacteristicsContinued
  • The observations in a sequence should have a
    clear, and meaningful order
  • Unordered observations will not map easily to
    states
  • Its beneficial, but not necessary for the
    observations follow some sort of grammar
  • Makes it easier to design an architecture
  • Gene Finding
  • Protein Family Modeling

17
HMM Requirements
  • So youve decided you want to build an HMM,
  • heres what you need
  • An architecture
  • Probably the hardest part
  • Should be biologically sound easy to interpret
  • A well-defined success measure
  • Necessary for any form of machine learning

18
HMM Requirements Continued
  • Training data
  • Labeled or unlabeled it depends
  • You do not always need a labeled training set to
    do observation labeling, but it helps
  • Amount of training data needed is
  • Directly proportional to the number of free
    parameters in the model
  • Inversely proportional to the size of the
    training sequences

19
Why HMMs might be a good fit for Gene Finding
  • Classification Classifying observations within a
    sequence
  • Order A DNA sequence is a set of ordered
    observations
  • Grammar / Architecture Our grammatical structure
    (and the beginnings of our architecture) is right
    here
  • Success measure of complete exons correctly
    labeled
  • Training data Available from various genome
    annotation projects

20
Why HMMs can be made to fit Protein Family
Modeling
  • Classification What model fits a sequence best?
  • Order An amino acid sequence is well ordered
  • Grammar Any two matches can be separated by a
    series of inserts and deletes okay, maybe the
    word grammar is a bit of a stretch
  • Success Measure How many sequences can we
    correctly label after training?

21
HMM Advantages
  • Statistical Grounding
  • Statisticians are comfortable with the theory
    behind hidden Markov models
  • Freedom to manipulate the training and
    verification processes
  • Mathematical / theoretical analysis of the
    results and processes
  • HMMs are still very powerful modeling tools far
    more powerful than many statistical methods

22
HMM Advantages continued
  • Modularity
  • HMMs can be combined into larger HMMs
  • Transparency of the Model
  • Assuming an architecture with a good design
  • People can read the model and make sense of it
  • The model itself can help increase understanding

23
HMM Advantages continued
  • Incorporation of Prior Knowledge
  • Incorporate prior knowledge into the architecture
  • Initialize the model close to something believed
    to be correct
  • Use prior knowledge to constrain training process

24
How does Gene Finding make use of HMM advantages?
  • Statistics
  • Many systems alter the training process to better
    suit their success measure
  • Modularity
  • Almost all systems use a combination of models,
    each individually trained for each gene region
  • Prior Knowledge
  • A fair amount of prior biological knowledge is
    built into each architecture

25
HMM Disadvantages
  • Markov Chains
  • States are supposed to be independent
  • P(y) must be independent of P(x), and vice versa
  • This usually isnt true
  • Can get around it when relationships are local
  • Not good for RNA folding problems

P(x)
P(y)

26
HMM Disadvantagescontinued
  • and then there are the standard Machine Learning
    Problems
  • Watch out for local maxima
  • Model may not converge to a truly optimal
    parameter set for a given training set
  • Avoid over-fitting
  • Youre only as good as your training set
  • More training is not always good

27
HMM Disadvantagescontinued
  • Speed!!!
  • Almost everything one does in an HMM involves
    enumerating all possible paths through the
    model
  • There are efficient ways to do this
  • Still slow in comparison to other methods

28
HMM Gene FindersVEIL
  • A straight HMM Gene Finder
  • Takes advantage of grammatical structure and
    modular design
  • Uses many states that can only emit one symbol to
    get around state independence

29
HMM Gene FindersHMMGene
  • Uses an extended HMM called a CHMM
  • CHMM HMM with classes
  • Takes full advantage of being able to modify the
    statistical algorithms
  • Uses high-order states
  • Trains everything at once

30
HMM Gene FindersGenie
  • Uses a generalized HMM (GHMM)
  • Edges in model are complete HMMs
  • States can be any arbitrary program
  • States are actually neural networks specially
    designed for signal finding

31
Conclusions
  • HMMs have problems where they excel, and problems
    where they do not
  • You should consider using one if
  • Problem can be phrased as classification
  • Observations are ordered
  • The observations follow some sort of grammatical
    structure (optional)

32
Conclusions
  • Advantages
  • Statistics
  • Modularity
  • Transparency
  • Prior Knowledge
  • Disadvantages
  • State independence
  • Over-fitting
  • Local Maximums
  • Speed

33
Some final words
  • Lots of problems can be phrased as classification
    problems
  • Homology search
  • Build a model of the sequence with a few close
    homologs, and use the model the search for more
    distant homologs
  • Sequence alignment
  • Align all of these sequences to the model that
    represents their family

34
Questions
  • Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com