Welcome to Introduction to Computational Genomics for Infectious Disease - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Welcome to Introduction to Computational Genomics for Infectious Disease

Description:

Welcome to. Introduction to Computational Genomics for Infectious Disease. Course Instructors ... First floor of Broad Main Lobby. See front desk attendant near ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 62
Provided by: jamesg68
Category:

less

Transcript and Presenter's Notes

Title: Welcome to Introduction to Computational Genomics for Infectious Disease


1
Welcome toIntroduction to Computational Genomics
for Infectious Disease
2
Course Instructors
  • Instructor
  • James Galagan
  • Teaching Assistants
  • Lab Instructors

Brian Weiner Desmond Lun
Antonis Rokas Mark Borowsky Jeremy Zucker
Reinhard Engels Aaron Brandes Caroline Colijn
Other members of Broad Microbial Analysis Group Other members of Broad Microbial Analysis Group Other members of Broad Microbial Analysis Group
3
Schedule and Logistics
  • Lectures
  • Labs

Tues/Thurs 11-1230 Harvard School of Public
Health FXB-301 The François-Xavier Bagnoud
Center, Room 301
Wed/Fri 1-3 Broad Institute Olympus Room First
floor of Broad Main Lobby See front desk
attendant near entrance Individual computers and
software provided No programming experience
required
4
Website
www.broad.mit.edu/annotation/winter_course_2006/
  • Contact information
  • Directions to Broad
  • Lecture slides
  • Lab handouts
  • Resources

5
Goals of Course
  • Introduction to concepts behind commonly used
    computational tools
  • Recognize connection between different concepts
    and applications
  • Hands on experience with computational analysis

6
Concepts and Applications
  • Lectures will cover concepts
  • Computationally oriented
  • Labs will provide opportunity for hands on
    application of tools
  • Nuts and bolts of running tools
  • Application of tools not covered in lectures

7
Computational Genomics Overview
Slide Credit Manolis Kellis
8
Topics
  1. Probabilistic Sequence Modeling
  2. Clustering and Classification
  3. Motifs
  4. Steady State Metabolic Modeling

9
Topics Not Covered
  • Sequence Alignment
  • Phylogeny (maybe in labs)
  • Molecular Evolution
  • Population Genetics
  • Advanced Machine Learning
  • Bayesian Networks
  • Conditional Random Fields

10
Applications to Infectious Disease
  • Examples and labs will focus on the analysis of
    microbial genomics data
  • Pathogenicity islands
  • TB expression analysis
  • Antigen prediction
  • Mycolic acid metabolism
  • But approaches are applicable to any organism and
    to many different questions

11
Probabilistic Modeling of Biological Sequences
  • Concepts
  • Statistical Modeling of Sequences
  • Hidden Markov Models
  • Applications
  • Predicting pathogenicity islands
  • Modeling protein families
  • Lab Practical
  • Basic sequence annotation

12
Probabilistic Sequence Modeling
  • Treat objects of interest as random variables
  • nucleotides, amino acids, genes, etc.
  • Model probability distributions for these
    variables
  • Use probability calculus to make inferences

13
Why Probabilistic Sequence Modeling?
  • Biological data is noisy
  • Probability provides a calculus for manipulating
    models
  • Not limited to yes/no answers can provide
    degrees of belief
  • Many common computational tools based on
    probabilistic models

14
Sequence Annotation
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATG
AAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAA
TGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCA
GGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGC
AGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGA
AGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTT
CC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGG
TCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGT
AAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGA
CCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGC
CGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACC
ACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGG
CTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCT
GACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGC
CGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCG
CCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTAC
CGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAA
T GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTG
GGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATC
TGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACT
GGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGC
GAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCG
AGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGG
CGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCG
CGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCT
GCACGAGGCCGAACTCCAACTCGCCGAG
15
Sequence Annotation
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATG
AAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAA
TGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCA
GGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGC
AGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGA
AGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTT
CC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGG
TCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGT
AAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGA
CCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGC
CGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACC
ACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGG
CTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCT
GACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGC
CGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCG
CCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTAC
CGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAA
T GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTG
GGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATC
TGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACT
GGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGC
GAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCG
AGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGG
CGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCG
CGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCT
GCACGAGGCCGAACTCCAACTCGCCGAG
Gene
16
Sequence Annotation
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATG
AAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAA
TGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCA
GGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGC
AGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGA
AGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTT
CC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGG
TCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGT
AAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGA
CCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGC
CGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACC
ACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGG
CTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCT
GACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGC
CGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCG
CCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTAC
CGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAA
T GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTG
GGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATC
TGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACT
GGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGC
GAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCG
AGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGG
CGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCG
CGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCT
GCACGAGGCCGAACTCCAACTCGCCGAG
Promoter Motif
Gene
Kinase Domain
17
Probabilistic Sequence Modeling
  • Hidden Markov Models (HMM)
  • A general framework for sequences of symbols
    (e.g. nucleotides, amino acids)
  • Widely used in computational genomics
  • Hmmer HMMs for protein families
  • Pathogenicity Islands

18
Pathogenicity Islands
  • Clusters of genes acquired by horizontal transfer
  • Present in pathogenic species but not others
  • Frequently encode virulence factors
  • Toxins, secondary metabolites, adhesins
  • (Flanked by repeats, gene content, phylogeny,
    regulation, codon usage)
  • Different GC content than rest of genome

19
Application Bacillus subtilis
20
Modeling Sequence Composition
  • Calculate sequence distribution from known
    islands
  • Count occurrences of A,T,G,C
  • Model islands as nucleotides drawn independently
    from this distribution



P(SiMP)
21
The Probability of a Sequence
  • Can calculate the probability of a particular
    sequence (S) according to the pathogenicity
    island model (MP)

Example
S AAATGCGCATTTCGAA
22
Sequence Classification
  • PROBLEM Given a sequence, is it an island?
  • We can calculate P(SMP), but what is a
    sufficient P value?
  • SOLUTION compare to a null model and calculate
    log-likelihood ratio
  • e.g. background DNA distribution model, B

Pathogenicity Islands
Background DNA
23
Finding Islands in Sequences
  • Could use the log-likelihood ratio on windows of
    fixed size
  • What if islands have variable length?
  • We prefer a model for entire sequence

TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTG
GCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCA
TATTGGC
24
A More Complex Model
Background
Island
TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTG
GCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCA
TATTGGC
25
A Generative Model
P
P
P
P
P
P
P
P
P
P
P
P
P
B
B
B
B
B
B
B
B
B
B
B
S
P(SP)
P(SB)
P(Li1Li)
Bi1 Pi1
Bi 0.85 0.15
Pi 0.25 0.75
26
A Hidden Markov Model
Hidden States L 1, ..., K Transition
probabilities aij Transition probability from
state i to state j Emission probabilities ei(b)
P( emitting b statei) Initial state
probability p(b) P(first stateb)
Transition Probabilities
State i
State j
ej(b)
ei(b)
Emission Probabilities
27
What can we do with this model?
  • The model defines a joint probability over labels
    and sequences, P(L,S)
  • Implicit in model is what labels tend to go
    with what sequences (and vice versa)
  • Rules of probability allow us to use this model
    to analyze existing sequences

28
Fundamental HMM Operations
Computation
Biology
  • Decoding
  • Given an HMM and sequence S
  • Find a corresponding sequence of labels, L
  • Evaluation
  • Given an HMM and sequence S
  • Find P(SHMM)
  • Training
  • Given an HMM w/o parameters and set of
    sequences S
  • Find transition and emission probabilities the
    maximize P(S params, HMM)

Annotate pathogenicity islands on a new
sequence Score a particular sequence (not as
useful for this model will come back to this
later) Learn a model for sequence composed of
background DNA and pathogenicity islands
29
The Hidden in HMM
  • DNA does not come conveniently labeled (i.e.
    Island, Gene, Promoter)
  • We observe nucleotide sequences
  • The hidden in HMM refers to the fact that state
    labels, L, are not observed
  • Only observe emissions (e.g. nucleotide sequence
    in our example)

State i
State j
A A G T T A G A G
30
Decoding With HMM
  • Given observables, we would like to predict a
    sequence of hidden states that is most likely to
    have generated that sequence

Pathogenicity Island Example
Given a nucleotide sequence, we want a labeling
of each nucleotide as either pathogenicity
island or background DNA
31
The Most Likely Path
  • Given a sequence, one reasonable choice for a
    labeling is

The sequence of labels, L, (or path) that makes
the labels and sequence most likely given the
model
32
Probability of a Path,Seq
P
P
P
P
P
P
P
P
L
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
S
33
Probability of a Path,Seq
P
P
P
P
P
P
P
P
L
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
S
We could try to calculate the probability of
every path, but.
34
Decoding
  • Viterbi Algorithm
  • Finds most likely sequence of labels, L, given
    sequence and model
  • Uses dynamic programming (same technique used in
    sequence alignment)
  • Much more efficient than searching every path

35
Probability of a Single Label
P
P
P
P
P
P
P
P
L
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
S
P(Label5BS)
Forward algorithm (dynamic programming)
  • Calculate most probable label, Li , at each
    position i
  • Do this for all N positions gives us L1, L2,
    L3. LN

36
Two Decoding Options
  • Viterbi Algorithm
  • Finds most likely sequence of labels, L, given
    sequence and model
  • Posterior Decoding
  • Finds most likely label at each position for all
    positions, given sequence and model
  • L1, L2, L3. LN
  • Forward and Backward equations

37
Application Bacillus subtilis
38
Method
Second Order Emissions P(Si)P(SiState,Si-1,Si-2
) (capturing trinucleotide Frequencies) Train
using EM Predict w/Posterior Decoding
Three State Model
Gene
Gene-

AT Rich
Nicolas et al (2002) NAR
39
Results
Gene on positive strand
Gene on negative strand
  • A/T Rich
  • Intergenic regions
  • Islands

Each line is P(labelS,model) color coded by
label
Nicolas et al (2002) NAR
40
Fundamental HMM Operations
Computation
Biology
  • Decoding
  • Given an HMM and sequence S
  • Find a corresponding sequence of labels, L
  • Evaluation
  • Given an HMM and sequence S
  • Find P(SHMM)
  • Training
  • Given an HMM w/o parameters and set of
    sequences S
  • Find transition and emission probabilities the
    maximize P(S params, HMM)

Annotate pathogenicity islands on a new
sequence Score a particular sequence (not as
useful for this model will come back to this
later) Learn a model for sequence composed of
background DNA and pathogenicity islands
41
Training an HMM
Transition probabilities e.g. P(Pi1Bi) the
probability of entering a pathogenicity island
from background DNA Emission probabilities i.e.
the nucleotide frequencies for background DNA and
pathogenicity islands
P(Li1Li)
B
P
P(SP)
P(SB)
42
Learning From Labelled Data
Maximum Likelihood Estimation
If we have a sequence that has islands marked, we
can simply count
P
P
P
P
P
P
P
P
L
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
S
P(SP)
P(SB)
P(Li1Li)
A T G C
A 1/5 T 0 G 2/5 C 2/5
Bi1 Pi1 End
Bi 3/5 1/5 1/5
Pi 1/3 2/3 0
Start 1 0 0
!
ETC..
43
Unlabelled Data
How do we know how to count?
P
P
P
P
P
P
P
P
L
start
B
B
B
B
B
B
B
B
End
G
C
A
A
A
T
G
C
S
P(SP)
P(SB)
P(Li1Li)
A T G C
A T G C
Bi1 Pi1 End
Bi
Pi ?
Start
44
Unlabeled Data
P
P
P
P
P
P
P
P
L
start
B
B
B
B
B
B
B
B
End
G
C
A
A
A
T
G
C
S
  • An idea
  • Imagine we start with some parameters
  • We could calculate the most likely path, P,
    given those parameters and S
  • We could then use P to update our parameters by
    maximum likelihood
  • And iterate (to convergence)


45
Expectation Maximization (EM)
  1. Initialize parameters
  2. E Step Estimate probability of hidden labels , Q,
    given parameters and sequence
  3. M Step Choose new parameters to maximize expected
    likelihood of parameters given Q
  4. Iterate

P(SModel) guaranteed to increase each iteration
46
Expectation Maximization (EM)
  • Remember the basic idea!
  • Use model to estimate (distribution of) missing
    data
  • Use estimate to update model
  • Repeat until convergence
  • EM is a general approach for learning models (ML
    estimation) when there is missing data
  • Widely used in computational biology
  • EM frequently used in motif discovery
  • Lecture 3

47
A More Sophisticated Application
Modeling Protein Families
  • Given amino acid sequences from a protein family,
    how can we find other members?
  • Can search databases with each known member not
    sensitive
  • More information is contained in full set
  • The HMM Profile Approach
  • Learn the statistical features of protein family
  • Model these features with an HMM
  • Search for new members by scoring with HMM

We will learn features from multiple alignments
48
Human Ubiquitin Conjugating Enzymes
  • UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
    --------SQWSPALTISK
  • UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
    --------SQWSPALTISK
  • BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-----
    --------SQWSPALTVSK
  • UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-----
    --------SQWSPALTVSK
  • UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
    --------DNWSPALTISK
  • UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
    --------DNWSPALTISK
  • UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-----
    --------DKWSPALQIRT
  • AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---
    ------PKGAWRPSLNIAT
  • UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-----
    --------EKWSALYDVRT
  • CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDD
    PQSGELPSERWNPTQNVRT
  • BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDD
    PQSGELPSERWNPTQNVRT
  • UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGED
    KYGYEKPEERWLPIHTVET
  • UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN----
    ---------RWSPTYDVSS
  • UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED---
    --------KDWRPAITIKQ
  • E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR----
    ---------DWTAELGIRH
  • UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA----
    --------ENWKPATKTDQ
  • UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS----
    --------ENWKPCTKTCQ
  • UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-----
    --------QTWTALYDLTN
  • UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-----
    --------EDWKPVLTINS

49
Profile HMM
50
Using Profile HMMs
Computation
Biology
  • Decoding
  • Find sequence of labels, L, that maximizes
    P(LS, HMM)
  • Evaluation
  • Find P(SHMM)
  • Training
  • Find transition and emission probabilities the
    maximize P(S params, HMM)

Align a new sequence to a protein
family Score a sequence for membership in
family Discover and model family structure
51
Example Modeling Globins
  • Profile HMM from 300 randomly selected globin
    genes
  • Score database of 60,000 proteins

52
PFAM Collection of Profile HMMs
http//www.sanger.ac.uk/Software/Pfam/
53
PFAM Resources
  • 8957 curated protein families and domains
  • Each with HMM profile(s)
  • Coverage
  • 73 of proteins in Swissprot and SP-TREMBLE
  • 53 of typical genome sequence

54
Example PFAM Entry
  • Literature Links
  • Protein Structure
  • Domain Architectures
  • GO Functional Categories

55
HMMER
  • Implementation of Profile HMM methods
  • Given a multiple alignment, HMMER can build a
    Profile HMM
  • Given a Profile HMM (i.e. from PFAM), HMMER can
    score sequences for membership in the family or
    domain

56
HMMs in Context
  • HMMs
  • Sequence alignment
  • Gene Prediction
  • Generalized HMMs
  • Variable length states
  • Complex emissions models
  • e.g. Genscan
  • Bayesian Networks
  • General graphical model
  • Arbitrary graph structure
  • e.g. Regulatory network analysis

57
References
  • Sean R Eddy, Hidden Markov models, Current
    Opinion in Structural Biology, 6361-365, 1996.
  • Sean R Eddy, Profile hidden Markov models,
    Bioinformatcis, 14(9)755-763, 1998.
  • Anders Krogh, An introduction to hidden Markov
    models for biological sequences, In
    computational Methods in Molecular Biology,
    edited by S. L. Salzberg, D. B. Searls and S.
    Kasif, pp. 45-63, Elsevier, 1998.
  • HMMER profile HMMs for protein sequence
    analysis. http//hmmer.wustl.edu/
  • Erik L. L. Sonnhammer et al, Pfam multiple
    sequence alignments andHMM-profiles of protein
    domains, Nucleic Acids Research, 26(1)320-322,
    1998.
  • R. Durbin, S. Eddy, A. Krogh and G. Mitchison,
    BIOLOGICAL SEQUENCE ANALYSIS, Cambridge
    University Press, 1998.

58
Tomorrows Lab
  • Basic Sequence Analysis Tools
  • Argo Genome Browser
  • Blast
  • Gene prediction using Glimmer
  • Protein families with Hmmer and PFAM
  • Comparative synteny analysis
  • Identify virulence factors by annotating and
    comparing virulent and avirulent bacterial
    sequences

59
(No Transcript)
60
The Hidden in HMM
  • DNA does not come conveniently labeled (i.e.
    Pathogencity Island, Gene, Promoter)
  • All we observe are the nucleotide sequences
  • The hidden in HMM refers to the fact that the
    state labels, L, are not observed
  • Only observe emissions (e.g. nucleotide sequence
    in our example)

61
Relation between Viterbi and Forward
  • VITERBI
  • Vj(i) P(most probable path ending in state j
    with observation i)
  • Initialization
  • V0(0) 1
  • Vk(0) 0, for all k gt 0
  • Iteration
  • Vj(i) ej(xi) maxk Vk(i-1) akj
  • Termination
  • P(x, ?) maxk Vk(N)
  • FORWARD
  • fl(i)P(x1xi,stateij)
  • Initialization
  • f0(0) 1
  • fk(0) 0, for all k gt 0
  • Iteration
  • fl(i) el(xi) ?k fk(i-1) akl
  • Termination
  • P(x) ?k fk(N) ak0

Slide Credit Serafim Batzoglou
Write a Comment
User Comments (0)
About PowerShow.com