Welcome to Introduction to Computational Genomics for Infectious Disease - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

Welcome to Introduction to Computational Genomics for Infectious Disease

Description:

Welcome to. Introduction to Computational Genomics for Infectious Disease. Course Instructors ... First floor of Broad Main Lobby. See front desk attendant near ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 62

Provided by: jamesg68

Category:

more less

Transcript and Presenter's Notes

Title: Welcome to Introduction to Computational Genomics for Infectious Disease

1
Welcome toIntroduction to Computational Genomics
for Infectious Disease
2
Course Instructors

Instructor
James Galagan
Teaching Assistants
Lab Instructors

Brian Weiner Desmond Lun
Antonis Rokas Mark Borowsky Jeremy Zucker
Reinhard Engels Aaron Brandes Caroline Colijn
Other members of Broad Microbial Analysis Group Other members of Broad Microbial Analysis Group Other members of Broad Microbial Analysis Group
3
Schedule and Logistics

Lectures
Labs

Tues/Thurs 11-1230 Harvard School of Public
Health FXB-301 The François-Xavier Bagnoud
Center, Room 301
Wed/Fri 1-3 Broad Institute Olympus Room First
floor of Broad Main Lobby See front desk
attendant near entrance Individual computers and
software provided No programming experience
required
4
Website
www.broad.mit.edu/annotation/winter_course_2006/

Contact information
Directions to Broad
Lecture slides
Lab handouts
Resources

5
Goals of Course

Introduction to concepts behind commonly used
computational tools
Recognize connection between different concepts
and applications
Hands on experience with computational analysis

6
Concepts and Applications

Lectures will cover concepts
Computationally oriented
Labs will provide opportunity for hands on
application of tools
Nuts and bolts of running tools
Application of tools not covered in lectures

7
Computational Genomics Overview
Slide Credit Manolis Kellis
8
Topics

Probabilistic Sequence Modeling
Clustering and Classification
Motifs
Steady State Metabolic Modeling

9
Topics Not Covered

Sequence Alignment
Phylogeny (maybe in labs)
Molecular Evolution
Population Genetics
Advanced Machine Learning
Bayesian Networks
Conditional Random Fields

10
Applications to Infectious Disease

Examples and labs will focus on the analysis of
microbial genomics data
Pathogenicity islands
TB expression analysis
Antigen prediction
Mycolic acid metabolism
But approaches are applicable to any organism and
to many different questions

11
Probabilistic Modeling of Biological Sequences

Concepts
Statistical Modeling of Sequences
Hidden Markov Models
Applications
Predicting pathogenicity islands
Modeling protein families
Lab Practical
Basic sequence annotation

12
Probabilistic Sequence Modeling

Treat objects of interest as random variables
nucleotides, amino acids, genes, etc.
Model probability distributions for these
variables
Use probability calculus to make inferences

13
Why Probabilistic Sequence Modeling?

Biological data is noisy
Probability provides a calculus for manipulating
models
Not limited to yes/no answers can provide
degrees of belief
Many common computational tools based on
probabilistic models

14
Sequence Annotation
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATG
AAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAA
TGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCA
GGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGC
AGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGA
AGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTT
CC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGG
TCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGT
AAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGA
CCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGC
CGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACC
ACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGG
CTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCT
GACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGC
CGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCG
CCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTAC
CGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAA
T GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTG
GGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATC
TGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACT
GGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGC
GAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCG
AGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGG
CGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCG
CGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCT
GCACGAGGCCGAACTCCAACTCGCCGAG
15
Sequence Annotation
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATG
AAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAA
TGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCA
GGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGC
AGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGA
AGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTT
CC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGG
TCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGT
AAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGA
CCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGC
CGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACC
ACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGG
CTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCT
GACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGC
CGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCG
CCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTAC
CGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAA
T GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTG
GGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATC
TGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACT
GGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGC
GAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCG
AGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGG
CGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCG
CGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCT
GCACGAGGCCGAACTCCAACTCGCCGAG
Gene
16
Sequence Annotation
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATG
AAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAA
TGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCA
GGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGC
AGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGA
AGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTT
CC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGG
TCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGT
AAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGA
CCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGC
CGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACC
ACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGG
CTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCT
GACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGC
CGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCG
CCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTAC
CGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAA
T GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTG
GGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATC
TGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACT
GGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGC
GAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCG
AGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGG
CGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCG
CGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCT
GCACGAGGCCGAACTCCAACTCGCCGAG
Promoter Motif
Gene
Kinase Domain
17
Probabilistic Sequence Modeling

Hidden Markov Models (HMM)
A general framework for sequences of symbols
(e.g. nucleotides, amino acids)
Widely used in computational genomics
Hmmer HMMs for protein families
Pathogenicity Islands

18
Pathogenicity Islands

Clusters of genes acquired by horizontal transfer
Present in pathogenic species but not others
Frequently encode virulence factors
Toxins, secondary metabolites, adhesins

(Flanked by repeats, gene content, phylogeny,
regulation, codon usage)
Different GC content than rest of genome

19
Application Bacillus subtilis
20
Modeling Sequence Composition

Calculate sequence distribution from known
islands
Count occurrences of A,T,G,C
Model islands as nucleotides drawn independently
from this distribution

P(SiMP)
21
The Probability of a Sequence

Can calculate the probability of a particular
sequence (S) according to the pathogenicity
island model (MP)

Example
S AAATGCGCATTTCGAA
22
Sequence Classification

PROBLEM Given a sequence, is it an island?
We can calculate P(SMP), but what is a
sufficient P value?
SOLUTION compare to a null model and calculate
log-likelihood ratio
e.g. background DNA distribution model, B

Pathogenicity Islands
Background DNA
23
Finding Islands in Sequences

Could use the log-likelihood ratio on windows of
fixed size
What if islands have variable length?
We prefer a model for entire sequence

TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTG
GCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCA
TATTGGC
24
A More Complex Model
Background
Island
TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTG
GCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCA
TATTGGC
25
A Generative Model
P
P
P
P
P
P
P
P
P
P
P
P
P
B
B
B
B
B
B
B
B
B
B
B
S
P(SP)
P(SB)
P(Li1Li)
Bi1 Pi1
Bi 0.85 0.15
Pi 0.25 0.75
26
A Hidden Markov Model
Hidden States L 1, ..., K Transition
probabilities aij Transition probability from
state i to state j Emission probabilities ei(b)
P( emitting b statei) Initial state
probability p(b) P(first stateb)
Transition Probabilities
State i
State j
ej(b)
ei(b)
Emission Probabilities
27
What can we do with this model?

The model defines a joint probability over labels
and sequences, P(L,S)
Implicit in model is what labels tend to go
with what sequences (and vice versa)
Rules of probability allow us to use this model
to analyze existing sequences

28
Fundamental HMM Operations
Computation
Biology

Decoding
Given an HMM and sequence S
Find a corresponding sequence of labels, L
Evaluation
Given an HMM and sequence S
Find P(SHMM)
Training
Given an HMM w/o parameters and set of
sequences S
Find transition and emission probabilities the
maximize P(S params, HMM)

DNA does not come conveniently labeled (i.e.
Island, Gene, Promoter)
We observe nucleotide sequences
The hidden in HMM refers to the fact that state
labels, L, are not observed
Only observe emissions (e.g. nucleotide sequence
in our example)

State i
State j
A A G T T A G A G
30
Decoding With HMM

Given observables, we would like to predict a
sequence of hidden states that is most likely to
have generated that sequence

Pathogenicity Island Example
Given a nucleotide sequence, we want a labeling
of each nucleotide as either pathogenicity
island or background DNA
31
The Most Likely Path

Given a sequence, one reasonable choice for a
labeling is

The sequence of labels, L, (or path) that makes
the labels and sequence most likely given the
model
32
Probability of a Path,Seq
P
P
P
P
P
P
P
P
L
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
S
33
Probability of a Path,Seq
P
P
P
P
P
P
P
P
L
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
S
We could try to calculate the probability of
every path, but.
34
Decoding

Viterbi Algorithm
Finds most likely sequence of labels, L, given
sequence and model
Uses dynamic programming (same technique used in
sequence alignment)
Much more efficient than searching every path

35
Probability of a Single Label
P
P
P
P
P
P
P
P
L
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
S
P(Label5BS)
Forward algorithm (dynamic programming)

Calculate most probable label, Li , at each
position i
Do this for all N positions gives us L1, L2,
L3. LN

36
Two Decoding Options

Viterbi Algorithm
Finds most likely sequence of labels, L, given
sequence and model
Posterior Decoding
Finds most likely label at each position for all
positions, given sequence and model
L1, L2, L3. LN
Forward and Backward equations

37
Application Bacillus subtilis
38
Method
Second Order Emissions P(Si)P(SiState,Si-1,Si-2
) (capturing trinucleotide Frequencies) Train
using EM Predict w/Posterior Decoding
Three State Model
Gene
Gene-

AT Rich
Nicolas et al (2002) NAR
39
Results
Gene on positive strand
Gene on negative strand

A/T Rich
Intergenic regions
Islands

Each line is P(labelS,model) color coded by
label
Nicolas et al (2002) NAR
40
Fundamental HMM Operations
Computation
Biology

Decoding
Given an HMM and sequence S
Find a corresponding sequence of labels, L
Evaluation
Given an HMM and sequence S
Find P(SHMM)
Training
Given an HMM w/o parameters and set of
sequences S
Find transition and emission probabilities the
maximize P(S params, HMM)

Annotate pathogenicity islands on a new
sequence Score a particular sequence (not as
useful for this model will come back to this
later) Learn a model for sequence composed of
background DNA and pathogenicity islands
41
Training an HMM
Transition probabilities e.g. P(Pi1Bi) the
probability of entering a pathogenicity island
from background DNA Emission probabilities i.e.
the nucleotide frequencies for background DNA and
pathogenicity islands
P(Li1Li)
B
P
P(SP)
P(SB)
42
Learning From Labelled Data
Maximum Likelihood Estimation
If we have a sequence that has islands marked, we
can simply count
P
P
P
P
P
P
P
P
L
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
S
P(SP)
P(SB)
P(Li1Li)
A T G C
A 1/5 T 0 G 2/5 C 2/5
Bi1 Pi1 End
Bi 3/5 1/5 1/5
Pi 1/3 2/3 0
Start 1 0 0
!
ETC..
43
Unlabelled Data
How do we know how to count?
P
P
P
P
P
P
P
P
L
start
B
B
B
B
B
B
B
B
End
G
C
A
A
A
T
G
C
S
P(SP)
P(SB)
P(Li1Li)
A T G C
A T G C
Bi1 Pi1 End
Bi
Pi ?
Start
44
Unlabeled Data
P
P
P
P
P
P
P
P
L
start
B
B
B
B
B
B
B
B
End
G
C
A
A
A
T
G
C
S

An idea
Imagine we start with some parameters
We could calculate the most likely path, P,
given those parameters and S
We could then use P to update our parameters by
maximum likelihood
And iterate (to convergence)

45
Expectation Maximization (EM)

Initialize parameters
E Step Estimate probability of hidden labels , Q,
given parameters and sequence
M Step Choose new parameters to maximize expected
likelihood of parameters given Q
Iterate

P(SModel) guaranteed to increase each iteration
46
Expectation Maximization (EM)

Remember the basic idea!
Use model to estimate (distribution of) missing
data
Use estimate to update model
Repeat until convergence
EM is a general approach for learning models (ML
estimation) when there is missing data
Widely used in computational biology

EM frequently used in motif discovery
Lecture 3

47
A More Sophisticated Application
Modeling Protein Families

Given amino acid sequences from a protein family,
how can we find other members?
Can search databases with each known member not
sensitive
More information is contained in full set
The HMM Profile Approach
Learn the statistical features of protein family
Model these features with an HMM
Search for new members by scoring with HMM

We will learn features from multiple alignments
48
Human Ubiquitin Conjugating Enzymes

UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
--------SQWSPALTISK
UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
--------SQWSPALTISK
BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-----
--------SQWSPALTVSK
UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-----
--------SQWSPALTVSK
UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
--------DNWSPALTISK
UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
--------DNWSPALTISK
UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-----
--------DKWSPALQIRT
AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---
------PKGAWRPSLNIAT
UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-----
--------EKWSALYDVRT
CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDD
PQSGELPSERWNPTQNVRT
BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDD
PQSGELPSERWNPTQNVRT
UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGED
KYGYEKPEERWLPIHTVET
UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN----
---------RWSPTYDVSS
UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED---
--------KDWRPAITIKQ
E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR----
---------DWTAELGIRH
UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA----
--------ENWKPATKTDQ
UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS----
--------ENWKPCTKTCQ
UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-----
--------QTWTALYDLTN
UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-----
--------EDWKPVLTINS

49
Profile HMM
50
Using Profile HMMs
Computation
Biology

Decoding
Find sequence of labels, L, that maximizes
P(LS, HMM)
Evaluation
Find P(SHMM)
Training
Find transition and emission probabilities the
maximize P(S params, HMM)

Align a new sequence to a protein
family Score a sequence for membership in
family Discover and model family structure
51
Example Modeling Globins

Profile HMM from 300 randomly selected globin
genes
Score database of 60,000 proteins

52
PFAM Collection of Profile HMMs
http//www.sanger.ac.uk/Software/Pfam/
53
PFAM Resources

8957 curated protein families and domains
Each with HMM profile(s)
Coverage
73 of proteins in Swissprot and SP-TREMBLE
53 of typical genome sequence

54
Example PFAM Entry

Literature Links
Protein Structure
Domain Architectures
GO Functional Categories

55
HMMER

Implementation of Profile HMM methods
Given a multiple alignment, HMMER can build a
Profile HMM
Given a Profile HMM (i.e. from PFAM), HMMER can
score sequences for membership in the family or
domain

56
HMMs in Context

HMMs
Sequence alignment
Gene Prediction
Generalized HMMs
Variable length states
Complex emissions models
e.g. Genscan
Bayesian Networks
General graphical model
Arbitrary graph structure
e.g. Regulatory network analysis

57
References

Sean R Eddy, Hidden Markov models, Current
Opinion in Structural Biology, 6361-365, 1996.
Sean R Eddy, Profile hidden Markov models,
Bioinformatcis, 14(9)755-763, 1998.
Anders Krogh, An introduction to hidden Markov
models for biological sequences, In
computational Methods in Molecular Biology,
edited by S. L. Salzberg, D. B. Searls and S.
Kasif, pp. 45-63, Elsevier, 1998.
HMMER profile HMMs for protein sequence
analysis. http//hmmer.wustl.edu/
Erik L. L. Sonnhammer et al, Pfam multiple
sequence alignments andHMM-profiles of protein
domains, Nucleic Acids Research, 26(1)320-322,
1998.
R. Durbin, S. Eddy, A. Krogh and G. Mitchison,
BIOLOGICAL SEQUENCE ANALYSIS, Cambridge
University Press, 1998.

58
Tomorrows Lab

Basic Sequence Analysis Tools
Argo Genome Browser
Blast
Gene prediction using Glimmer
Protein families with Hmmer and PFAM
Comparative synteny analysis
Identify virulence factors by annotating and
comparing virulent and avirulent bacterial
sequences

59
(No Transcript)
60
The Hidden in HMM

DNA does not come conveniently labeled (i.e.
Pathogencity Island, Gene, Promoter)
All we observe are the nucleotide sequences
The hidden in HMM refers to the fact that the
state labels, L, are not observed
Only observe emissions (e.g. nucleotide sequence
in our example)

61
Relation between Viterbi and Forward