CS5263 Bioinformatics - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

CS5263 Bioinformatics

Description:

Mj. Dealing with Deletions. This would allow an arbitrary length of deletion ... Mj. E. Dj. Still allows any length of deletions. Fewer parameters. Less ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 51
Provided by: jianhu
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics


1
CS5263 Bioinformatics
  • Lecture 14 Hidden Markov Models and applications

2
  • Last lecture
  • Some practical issues in HMM modeling
  • HMMs for pairwise sequence alignment
  • Today
  • HMM for multiple sequence alignment
  • HMM for gene prediction

3
Duration models
p
P(L)
s
1-p
L
p
s
s
s
s
1-p
Min, then geometric
p
p
p
p
s
s
s
s
1-p
1-p
1-p
1-p
Negative binominal
4
Explicit duration modeling
Intron
Exon
Intergenic
P(A I) 0.3 P(C I) 0.2 P(G I) 0.2 P(T
I) 0.3
Generalized HMM. Often used in gene finders
P
L
Empirical intron length distribution
5
Silent states
  • Silent states are states that do not emit symbols
    (e.g., the state 0 in our previous examples)
  • Silent states can be introduced in HMMs to reduce
    the number of transitions

Dj
Silent state
B
E
B
Mj
E
6
FSA for global alignment
e
Xi aligned to a gap
?
d
Xi and Yj aligned
?
d
Yj aligned to a gap
e
?
7
Pair-HMM for global alignment
?
Xi aligned to a gap
1- ?
q(xi) 4 emission probabilities
?
Xi and Yj aligned
1-2?
q(yj) 4 emission probabilities
?
Yj aligned to a gap
P(xi,yj)
?
16 emission probabilities
1-?
Pair-wise HMM emit two sequences
simultaneously Algorithm is similar to regular
HMM, but need an additional dimension. e.g. in
Viterbi, we need Vk(i, j) instead of Vk(i)
8
Pair-HMM and FSA for alignment
  • After proper transformation between the
    probabilities and substitution scores, the two
    are identical
  • ?(a, b) ? log p(a, b) / (q(a) q(b))
  • d ? log ?
  • e ? log ?
  • Details in Durbin book chap 4
  • Finding an optimal FSA alignment is equivalent to
    finding the most probable path with Viterbi
  • Has more use in comparative gene finding than in
    seq alignment

9
  • HMM for multiple sequence alignment

10
Question
  • How to represent a multiple alignment so that
    comparing a single sequence with an alignment can
    be done efficiently?

11
Solution 1
  • Use regular expression to represent the consensus
    sequences
  • Implemented in the ProSite database
  • for example
  • C - x(2) - P - F - x - FYWIV - x(7) - C -
    x(8,10) - W - C - x(4) - DNSR - FYW - x(3,5)
    - FYW - x - FYWI - C

12
Multi-alignments represented by regular
expression
  • Intuitive
  • Flexible
  • Efficient matching algorithms
  • Mismatches allowed possibly
  • Problems
  • Limited power in expressing more divergent
    positions
  • Specify boundaries of indels not so easy
  • May have to change the regular expression when a
    new member of a protein family is identified

13
Solution 2
  • For a non-gapped alignment, we can create a
    position-specific weight matrix (PWM)
  • PSWM
  • PSSM
  • Also called a profile
  • May use pseudocounts

14
Scoring by PWMs
x KCIDNTHIKR
P(x M) ?i ei(xi) Random model each amino
acid xi can be generated with probability
q(xi) P(x random) ?i q(xi) Log-odds ratio
S log P(XM) / P(Xrandom) ?i log
ei(xi) / q(xi)
15
PWMs
  • Advantage
  • Can be used to identify both strong and weak
    homologies
  • Easy to implement and use
  • Probabilistic interpretation
  • Problem
  • Not intuitive to deal with gaps

16
PWMs are HMMs
B
E
M1
Mk
Transition probability 1
20 emission probabilities for each state
  • This can only be used to search for sequences
    without insertion / deletions (indels)
  • We can add additional states for indels!

17
Dealing with insertions
Ij
B
E
M1
Mj
Mk
  • This would allow an arbitrary number of
    insertions after the j-th position
  • i.e. the sequence being compared can be longer
    than the PWM

18
Dealing with insertions
I1
Ij
Ik
B
E
M1
Mj
Mk
  • This allows insertions at any position

19
Dealing with Deletions
B
E
Mi
Mj
  • This would allow a subsequence i, j being
    deleted
  • i.e. the sequence being compared can be shorter
    than the PWM

20
Dealing with Deletions
B
E
  • This would allow an arbitrary length of deletion
  • Completely forward connected
  • Too many transitions

21
Deletion with silent states
Dj
Silent state
B
Mj
E
  • Still allows any length of deletions
  • Fewer parameters
  • Less detailed control

22
Full model
  • Called profile-HMM

Dj
D deletion state I insertion state M matching
state
Ij
B
Mj
E
Algorithm basically the same as a regular HMM
23
Using profile HMM
  • Alignment
  • Align a sequence to a profile HMM
  • Viterbi
  • Searching
  • Given a sequence and HMMs for different protein
    families, which family the sequence may belong
    to?
  • Given a HMM for a protein family and many
    proteins, which protein may belong to the family?
  • Viterbi or forward
  • How to score?
  • Training / Learning
  • Given a multiple alignment, how to construct a
    HMM?
  • Given an unaligned protein family, how to
    construct a HMM?

24
Pfam
  • A database of protein families
  • Developed by Sean Eddy and colleagues while
    working in Durbins lab
  • Hand-curated seed multiple alignment
  • Train HMM from seed alignment
  • Hand-chosen score thresholds
  • Automatic classification / classification of all
    other protein sequences
  • 7973 families in Rfam 18.0, 8/2005
  • (covers 75 of proteins)

25
Modeling building from aligned sequences
  • Match for columns with no gaps
  • When gaps exist, how to decide whether they are
    insertions or matches?
  • Heuristic rule gt50 gaps, insertion, otherwise,
    match
  • How to add pseudocount?
  • Simply add one
  • According to background distribution
  • Use a mixture of priors (Dirchlet mixtures)
  • Sequence weighting
  • Very similar sequences should each get less weight

26
Modeling building from unaligned sequences
  • Choose a model length and initial parameters
  • Commonly use average seq length as model length
  • Baum-Welch or Viterbi training
  • Usually necessary to use multiple starting points
    or other heuristics to escape from local optima
  • Align all sequences to the final model using
    Viterbi
  • Why not F-B?

27
Align a seq to profile HMMs
Viterbi
28
Searching
Protein Database
  • Scoring
  • Log likelihood Log P(X M)
  • Log odds Log P(X M) / P(X random)
  • Length-normalization
  • Is the hit real?
  • How does the score compare with that of the
    sequences already in the family?
  • How does the score compare with that of random
    sequences?


Score for each protein
29
Example modeling and searching for globins
  • 300 random picked globin sequence
  • Build a profile HMM from scratch (without
    pre-alignment)
  • Align 60,000 proteins to the HMM
  • Majority are non-globin

30
  • Even after length normalization, LL is still
    length-dependent
  • Log-odd score provides better separation
  • Takes amino acid composition into account
  • Real globins could have scores less than 0

31
  • Estimate mean score and standard deviation for
    non-globin sequences for each length
  • Z-score (raw score mean) / (standard
    deviation)
  • Z-score is length-invariant
  • Real globins have positive scores

32
  • Gene prediction

33
Gene structure
intron1
intron2
exon2
exon3
Intergenic
exon1
5
3
transcription
splicing
translation
Exon coding Intron non-coding Intergenic
non-coding
34
Finding genes
Human
Fugu
worm
E.coli
As the coding/non-coding length ratio decreases,
exon prediction becomes more complex
35
Gene prediction in prokaryote
  • Finding long ORFs (open reading frame)
  • An ORF may not contain stop codons
  • Average ORF length 64/3
  • Expect 300bp ORF per 36kbp
  • Actual ORF length 1000bp
  • Codon biases
  • Some triplets are used more frequently than
    others
  • Codon third position biases

36
HMM for eukaryote gene finding
  • Most basic idea is the same the distributions of
    nucleotides is different in exon and other
    regions
  • Alone wont work very well
  • More signals are needed
  • How to combine all the signal together?

37
Simplest model
intron
Intergenic
exon
4 emission probability
64 triplets emission probabilities
4 emission probability
Actually more accurate at the di-amino-acid
level, i.e. 2 codons. Many methods use 5th-order
Markov model for all regions
  • Exon length may not be exact multiple of 3
  • Basically have to triple the number of states to
    remember the excess number of bases in the
    previous state

38
More detailed model
Single exon
Init exon
intron
Intergenic
Term exon
Internal Exon
39
Sub-models
CDS coding sequence
5-UTR
START
CDS
Init exon
3-UTR
STOP
CDS
Term exon
  • START, STOP are PWMs
  • Including start and stop codons and surrounding
    bases

40
Sub-model for intron
Splice donor
Intron body
Splice acceptor
Intron
  • Sequence logos an informative display of PWMs
  • Within each column, relative height represents
    probability
  • Height of each column reflects information
    content

41
(No Transcript)
42
Explicit duration modeling
x1 x2 x3 ..xN
1
2
Pk(i)
K
  • For each position j and each state i
  • Need to consider the transition from all previous
    positions
  • Time O(N2K2)
  • N can be 108

43
Speedup GHMM
  • Restrict maximum duration length to be L
  • O(LNK2)
  • However, intergenic and intron can be quite long
  • L can be 105
  • Compromise explicit duration for exons only,
    geometric for all other states
  • Pre-compute all possible starting points of ORFs
  • For init exon ATG
  • For internal/terminal exon splice donor signal
    (GT)

44
GeneScan model
45
Approaches to gene finding
  • Homology
  • BLAST, BLAT, etc.
  • Ab initio
  • Genscan, Glimmer, Fgenesh, GeneMark, etc.
  • Each one has been tuned towards certain organisms
  • Hybrids
  • Twinscan, SLAM
  • Use pair-HMM, or pre-compute score for potential
    coding regions based on alignment
  • None are perfect, never used alone in practice

46
Current status
  • More accurate on internal exons
  • Determining boundaries of init and term exons is
    hard
  • Biased towards multiple-exon genes
  • Alternative splicing is hard
  • Non-coding RNA is hard

47
  • State of the Art
  • predictions 60 similar to real proteins
  • 80 if database similarity used
  • lab verification still needed, still expensive

48
HMM wrap up
  • Weve talked about
  • Probability, mainly Bayes Theorem
  • Markov models
  • Hidden Markov models
  • HMM parameter estimation given state path
  • Decoding given HMM and parameters
  • Viterbi
  • F-B
  • Learning
  • Baum-Welch (Expectation-Maximization)
  • Viterbi

49
HMM wrap up
  • Weve also talked about
  • Extension to gHMMs
  • HMM for multiple sequence alignment
  • gHMM for gene finding
  • We did not talk about
  • Higher-order Markov models
  • How to escape from local optima in learning

50
Next lectures
  • String pattern matching algorithms
  • Suffix tree and its applications
  • Motif finding
  • Biology
  • Algorithm
  • Combinatorial
  • Probabilistic
Write a Comment
User Comments (0)
About PowerShow.com