Loading...

PPT – Welcome to Introduction to Computational Genomics for Infectious Disease PowerPoint presentation | free to view - id: f2260-ZDc1Z

The Adobe Flash plugin is needed to view this content

Welcome to Introduction to Computational Genomics

for Infectious Disease

Course Instructors

- Instructor
- James Galagan
- Teaching Assistants
- Lab Instructors

Brian Weiner Desmond Lun

Antonis Rokas Mark Borowsky Jeremy Zucker

Reinhard Engels Aaron Brandes Caroline Colijn

Other members of Broad Microbial Analysis Group Other members of Broad Microbial Analysis Group Other members of Broad Microbial Analysis Group

Schedule and Logistics

- Lectures
- Labs

Tues/Thurs 11-1230 Harvard School of Public

Health FXB-301 The François-Xavier Bagnoud

Center, Room 301

Wed/Fri 1-3 Broad Institute Olympus Room First

floor of Broad Main Lobby See front desk

attendant near entrance Individual computers and

software provided No programming experience

required

Website

www.broad.mit.edu/annotation/winter_course_2006/

- Contact information
- Directions to Broad
- Lecture slides
- Lab handouts
- Resources

Goals of Course

- Introduction to concepts behind commonly used

computational tools - Recognize connection between different concepts

and applications - Hands on experience with computational analysis

Concepts and Applications

- Lectures will cover concepts
- Computationally oriented
- Labs will provide opportunity for hands on

application of tools - Nuts and bolts of running tools
- Application of tools not covered in lectures

Computational Genomics Overview

Slide Credit Manolis Kellis

Topics

- Probabilistic Sequence Modeling
- Clustering and Classification
- Motifs
- Steady State Metabolic Modeling

Topics Not Covered

- Sequence Alignment
- Phylogeny (maybe in labs)
- Molecular Evolution
- Population Genetics
- Advanced Machine Learning
- Bayesian Networks
- Conditional Random Fields

Applications to Infectious Disease

- Examples and labs will focus on the analysis of

microbial genomics data - Pathogenicity islands
- TB expression analysis
- Antigen prediction
- Mycolic acid metabolism
- But approaches are applicable to any organism and

to many different questions

Probabilistic Modeling of Biological Sequences

- Concepts
- Statistical Modeling of Sequences
- Hidden Markov Models
- Applications
- Predicting pathogenicity islands
- Modeling protein families
- Lab Practical
- Basic sequence annotation

Probabilistic Sequence Modeling

- Treat objects of interest as random variables
- nucleotides, amino acids, genes, etc.
- Model probability distributions for these

variables - Use probability calculus to make inferences

Why Probabilistic Sequence Modeling?

- Biological data is noisy
- Probability provides a calculus for manipulating

models - Not limited to yes/no answers can provide

degrees of belief - Many common computational tools based on

probabilistic models

Sequence Annotation

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATG

AAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAA

TGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCA

GGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGC

AGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGA

AGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTT

CC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGG

TCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGT

AAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGA

CCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGC

CGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACC

ACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGG

CTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCT

GACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGC

CGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCG

CCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTAC

CGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAA

T GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTG

GGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATC

TGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACT

GGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGC

GAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCG

AGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGG

CGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCG

CGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCT

GCACGAGGCCGAACTCCAACTCGCCGAG

Sequence Annotation

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATG

AAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAA

TGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCA

GGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGC

AGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGA

AGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTT

CC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGG

TCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGT

AAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGA

CCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGC

CGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACC

ACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGG

CTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCT

GACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGC

CGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCG

CCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTAC

CGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAA

T GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTG

GGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATC

TGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACT

GGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGC

GAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCG

AGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGG

CGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCG

CGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCT

GCACGAGGCCGAACTCCAACTCGCCGAG

Gene

Sequence Annotation

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATG

AAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAA

TGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCA

GGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGC

AGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGA

AGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTT

CC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGG

TCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGT

AAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGA

CCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGC

CGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACC

ACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGG

CTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCT

GACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGC

CGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCG

CCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTAC

CGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAA

T GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTG

GGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATC

TGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACT

GGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGC

GAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCG

AGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGG

CGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCG

CGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCT

GCACGAGGCCGAACTCCAACTCGCCGAG

Promoter Motif

Gene

Kinase Domain

Probabilistic Sequence Modeling

- Hidden Markov Models (HMM)
- A general framework for sequences of symbols

(e.g. nucleotides, amino acids) - Widely used in computational genomics
- Hmmer HMMs for protein families
- Pathogenicity Islands

Pathogenicity Islands

- Clusters of genes acquired by horizontal transfer
- Present in pathogenic species but not others
- Frequently encode virulence factors
- Toxins, secondary metabolites, adhesins

- (Flanked by repeats, gene content, phylogeny,

regulation, codon usage) - Different GC content than rest of genome

Application Bacillus subtilis

Modeling Sequence Composition

- Calculate sequence distribution from known

islands - Count occurrences of A,T,G,C
- Model islands as nucleotides drawn independently

from this distribution

P(SiMP)

The Probability of a Sequence

- Can calculate the probability of a particular

sequence (S) according to the pathogenicity

island model (MP)

Example

S AAATGCGCATTTCGAA

Sequence Classification

- PROBLEM Given a sequence, is it an island?
- We can calculate P(SMP), but what is a

sufficient P value? - SOLUTION compare to a null model and calculate

log-likelihood ratio - e.g. background DNA distribution model, B

Pathogenicity Islands

Background DNA

Finding Islands in Sequences

- Could use the log-likelihood ratio on windows of

fixed size - What if islands have variable length?
- We prefer a model for entire sequence

TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTG

GCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCA

TATTGGC

A More Complex Model

Background

Island

TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTG

GCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCA

TATTGGC

A Generative Model

P

P

P

P

P

P

P

P

P

P

P

P

P

B

B

B

B

B

B

B

B

B

B

B

S

P(SP)

P(SB)

P(Li1Li)

Bi1 Pi1

Bi 0.85 0.15

Pi 0.25 0.75

A Hidden Markov Model

Hidden States L 1, ..., K Transition

probabilities aij Transition probability from

state i to state j Emission probabilities ei(b)

P( emitting b statei) Initial state

probability p(b) P(first stateb)

Transition Probabilities

State i

State j

ej(b)

ei(b)

Emission Probabilities

What can we do with this model?

- The model defines a joint probability over labels

and sequences, P(L,S) - Implicit in model is what labels tend to go

with what sequences (and vice versa) - Rules of probability allow us to use this model

to analyze existing sequences

Fundamental HMM Operations

Computation

Biology

- Decoding
- Given an HMM and sequence S
- Find a corresponding sequence of labels, L
- Evaluation
- Given an HMM and sequence S
- Find P(SHMM)
- Training
- Given an HMM w/o parameters and set of

sequences S - Find transition and emission probabilities the

maximize P(S params, HMM)

Annotate pathogenicity islands on a new

sequence Score a particular sequence (not as

useful for this model will come back to this

later) Learn a model for sequence composed of

background DNA and pathogenicity islands

The Hidden in HMM

- DNA does not come conveniently labeled (i.e.

Island, Gene, Promoter) - We observe nucleotide sequences
- The hidden in HMM refers to the fact that state

labels, L, are not observed - Only observe emissions (e.g. nucleotide sequence

in our example)

State i

State j

A A G T T A G A G

Decoding With HMM

- Given observables, we would like to predict a

sequence of hidden states that is most likely to

have generated that sequence

Pathogenicity Island Example

Given a nucleotide sequence, we want a labeling

of each nucleotide as either pathogenicity

island or background DNA

The Most Likely Path

- Given a sequence, one reasonable choice for a

labeling is

The sequence of labels, L, (or path) that makes

the labels and sequence most likely given the

model

Probability of a Path,Seq

P

P

P

P

P

P

P

P

L

B

B

B

B

B

B

B

B

G

C

A

A

A

T

G

C

S

Probability of a Path,Seq

P

P

P

P

P

P

P

P

L

B

B

B

B

B

B

B

B

G

C

A

A

A

T

G

C

S

We could try to calculate the probability of

every path, but.

Decoding

- Viterbi Algorithm
- Finds most likely sequence of labels, L, given

sequence and model - Uses dynamic programming (same technique used in

sequence alignment) - Much more efficient than searching every path

Probability of a Single Label

P

P

P

P

P

P

P

P

L

B

B

B

B

B

B

B

B

G

C

A

A

A

T

G

C

S

P(Label5BS)

Forward algorithm (dynamic programming)

- Calculate most probable label, Li , at each

position i - Do this for all N positions gives us L1, L2,

L3. LN

Two Decoding Options

- Viterbi Algorithm
- Finds most likely sequence of labels, L, given

sequence and model - Posterior Decoding
- Finds most likely label at each position for all

positions, given sequence and model - L1, L2, L3. LN
- Forward and Backward equations

Application Bacillus subtilis

Method

Second Order Emissions P(Si)P(SiState,Si-1,Si-2

) (capturing trinucleotide Frequencies) Train

using EM Predict w/Posterior Decoding

Three State Model

Gene

Gene-

AT Rich

Nicolas et al (2002) NAR

Results

Gene on positive strand

Gene on negative strand

- A/T Rich
- Intergenic regions
- Islands

Each line is P(labelS,model) color coded by

label

Nicolas et al (2002) NAR

Fundamental HMM Operations

Computation

Biology

- Decoding
- Given an HMM and sequence S
- Find a corresponding sequence of labels, L
- Evaluation
- Given an HMM and sequence S
- Find P(SHMM)
- Training
- Given an HMM w/o parameters and set of

sequences S - Find transition and emission probabilities the

maximize P(S params, HMM)

Annotate pathogenicity islands on a new

sequence Score a particular sequence (not as

useful for this model will come back to this

later) Learn a model for sequence composed of

background DNA and pathogenicity islands

Training an HMM

Transition probabilities e.g. P(Pi1Bi) the

probability of entering a pathogenicity island

from background DNA Emission probabilities i.e.

the nucleotide frequencies for background DNA and

pathogenicity islands

P(Li1Li)

B

P

P(SP)

P(SB)

Learning From Labelled Data

Maximum Likelihood Estimation

If we have a sequence that has islands marked, we

can simply count

P

P

P

P

P

P

P

P

L

B

B

B

B

B

B

B

B

G

C

A

A

A

T

G

C

S

P(SP)

P(SB)

P(Li1Li)

A T G C

A 1/5 T 0 G 2/5 C 2/5

Bi1 Pi1 End

Bi 3/5 1/5 1/5

Pi 1/3 2/3 0

Start 1 0 0

!

ETC..

Unlabelled Data

How do we know how to count?

P

P

P

P

P

P

P

P

L

start

B

B

B

B

B

B

B

B

End

G

C

A

A

A

T

G

C

S

P(SP)

P(SB)

P(Li1Li)

A T G C

A T G C

Bi1 Pi1 End

Bi

Pi ?

Start

Unlabeled Data

P

P

P

P

P

P

P

P

L

start

B

B

B

B

B

B

B

B

End

G

C

A

A

A

T

G

C

S

- An idea
- Imagine we start with some parameters
- We could calculate the most likely path, P,

given those parameters and S - We could then use P to update our parameters by

maximum likelihood - And iterate (to convergence)

Expectation Maximization (EM)

- Initialize parameters
- E Step Estimate probability of hidden labels , Q,

given parameters and sequence - M Step Choose new parameters to maximize expected

likelihood of parameters given Q - Iterate

P(SModel) guaranteed to increase each iteration

Expectation Maximization (EM)

- Remember the basic idea!
- Use model to estimate (distribution of) missing

data - Use estimate to update model
- Repeat until convergence
- EM is a general approach for learning models (ML

estimation) when there is missing data - Widely used in computational biology

- EM frequently used in motif discovery
- Lecture 3

A More Sophisticated Application

Modeling Protein Families

- Given amino acid sequences from a protein family,

how can we find other members? - Can search databases with each known member not

sensitive - More information is contained in full set
- The HMM Profile Approach
- Learn the statistical features of protein family
- Model these features with an HMM
- Search for new members by scoring with HMM

We will learn features from multiple alignments

Human Ubiquitin Conjugating Enzymes

- UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----

--------SQWSPALTISK - UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----

--------SQWSPALTISK - BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-----

--------SQWSPALTVSK - UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-----

--------SQWSPALTVSK - UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----

--------DNWSPALTISK - UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----

--------DNWSPALTISK - UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-----

--------DKWSPALQIRT - AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---

------PKGAWRPSLNIAT - UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-----

--------EKWSALYDVRT - CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDD

PQSGELPSERWNPTQNVRT - BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDD

PQSGELPSERWNPTQNVRT - UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGED

KYGYEKPEERWLPIHTVET - UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN----

---------RWSPTYDVSS - UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED---

--------KDWRPAITIKQ - E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR----

---------DWTAELGIRH - UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA----

--------ENWKPATKTDQ - UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS----

--------ENWKPCTKTCQ - UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-----

--------QTWTALYDLTN - UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-----

--------EDWKPVLTINS

Profile HMM

Using Profile HMMs

Computation

Biology

- Decoding
- Find sequence of labels, L, that maximizes

P(LS, HMM) - Evaluation
- Find P(SHMM)
- Training
- Find transition and emission probabilities the

maximize P(S params, HMM)

Align a new sequence to a protein

family Score a sequence for membership in

family Discover and model family structure

Example Modeling Globins

- Profile HMM from 300 randomly selected globin

genes - Score database of 60,000 proteins

PFAM Collection of Profile HMMs

http//www.sanger.ac.uk/Software/Pfam/

PFAM Resources

- 8957 curated protein families and domains
- Each with HMM profile(s)
- Coverage
- 73 of proteins in Swissprot and SP-TREMBLE
- 53 of typical genome sequence

Example PFAM Entry

- Literature Links
- Protein Structure
- Domain Architectures
- GO Functional Categories

HMMER

- Implementation of Profile HMM methods
- Given a multiple alignment, HMMER can build a

Profile HMM - Given a Profile HMM (i.e. from PFAM), HMMER can

score sequences for membership in the family or

domain

HMMs in Context

- HMMs
- Sequence alignment
- Gene Prediction
- Generalized HMMs
- Variable length states
- Complex emissions models
- e.g. Genscan
- Bayesian Networks
- General graphical model
- Arbitrary graph structure
- e.g. Regulatory network analysis

References

- Sean R Eddy, Hidden Markov models, Current

Opinion in Structural Biology, 6361-365, 1996. - Sean R Eddy, Profile hidden Markov models,

Bioinformatcis, 14(9)755-763, 1998. - Anders Krogh, An introduction to hidden Markov

models for biological sequences, In

computational Methods in Molecular Biology,

edited by S. L. Salzberg, D. B. Searls and S.

Kasif, pp. 45-63, Elsevier, 1998. - HMMER profile HMMs for protein sequence

analysis. http//hmmer.wustl.edu/ - Erik L. L. Sonnhammer et al, Pfam multiple

sequence alignments andHMM-profiles of protein

domains, Nucleic Acids Research, 26(1)320-322,

1998. - R. Durbin, S. Eddy, A. Krogh and G. Mitchison,

BIOLOGICAL SEQUENCE ANALYSIS, Cambridge

University Press, 1998.

Tomorrows Lab

- Basic Sequence Analysis Tools
- Argo Genome Browser
- Blast
- Gene prediction using Glimmer
- Protein families with Hmmer and PFAM
- Comparative synteny analysis
- Identify virulence factors by annotating and

comparing virulent and avirulent bacterial

sequences

(No Transcript)

The Hidden in HMM

- DNA does not come conveniently labeled (i.e.

Pathogencity Island, Gene, Promoter) - All we observe are the nucleotide sequences
- The hidden in HMM refers to the fact that the

state labels, L, are not observed - Only observe emissions (e.g. nucleotide sequence

in our example)

Relation between Viterbi and Forward

- VITERBI
- Vj(i) P(most probable path ending in state j

with observation i) - Initialization
- V0(0) 1
- Vk(0) 0, for all k gt 0
- Iteration
- Vj(i) ej(xi) maxk Vk(i-1) akj
- Termination
- P(x, ?) maxk Vk(N)

- FORWARD
- fl(i)P(x1xi,stateij)
- Initialization
- f0(0) 1
- fk(0) 0, for all k gt 0
- Iteration
- fl(i) el(xi) ?k fk(i-1) akl
- Termination
- P(x) ?k fk(N) ak0

Slide Credit Serafim Batzoglou