Susie Jo

About This Presentation

Transcript and Presenter's Notes

Title: Susie Jo

1
Protein Family Classification Using AI Techniques
(Profile-HMMs, SVM)
2003. 12.04 Susie Jo Bio-Information System
Laboratory BioSystem Dept., KAIST
2
Table of Contents
Intro
Profile HMM
SVM-Fisher
3
Protein homology detection
CS774

Protein, DNA can be encoded in primary sequence
(amino acid residue20 types,
nucleotideA/G/C/T)
Functionally Annotated Sequence
Functionally Unknown Sequence

Introduction
Profile HMM
SVM-Fisher
Similar Sequence
Similar Function
SOM
Sequence Similarity
4
Protein Classification SCOP (Structural
Classification of Protein Database)
CS774

SCOP hierarchy of protein domains

Introduction
Profile HMM
SVM-Fisher
SOM
Primary Level
Varying degrees of similarity
5
GPCR
CS774

The three major subfamilies include the receptors
related to the light receptor rhodopsin and
ß2-adrenergic receptor (family A)
can be subdivided into six major subgroups
overall homology among all type A receptors is
low
highly conserved a few key residues
Asp-Arg-Tyr (DRY) motif
Receptors related to the glucagon receptor
(family B)
The receptors related to the metabotropic
neurotransmitter receptors (family C)
Yeast pheromone receptor (family D, E)
cAMP recptors (family F)

Introduction
Profile HMM
SVM-Fisher
SOM
6
GPCR 3 major subfamilies
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
Low Sequence Homology
7
Protein remote homology detection Methods
CS774

Pair-wise similarities between proteins
Simple Sequence Similarity
Using Smith-Waterman dynamic programming
BLAST, FASTA
)Simple, Easy
- )Low accuracy

Introduction
Profile HMM
SVM-Fisher
SOM
Urotensin is very similar with 4 somatostatin gt
However, actually they have different ligands
8
Protein remote homology detection Methods
CS774

Profiles and hidden Markov models(HMMs)
Profile-based methods by iteratively collecting
homologous sequences from a large database and
incorporating the resulting statistics into a
single model
PSI-BLAST and SAM-T98
3. SVM-Fisher method(Jaakkola et al.,
1999,2000)
couples an iterative HMM training scheme with
the SVM

Introduction
Profile HMM
SVM-Fisher
SOM
9
(No Transcript)
10
Motivation
CS774

Objective Given a family of related sequences,
what is an effective way to capture what they
have in common, so that we can recognize other
members of the family.
Some standard methods for characterization
- Multiple alignments
- Profile
- Regular Expressions
- Consensus Sequences
- Hidden Markov Models

Introduction
Profile HMM
SVM-Fisher
SOM
11
A.Gaulton T.K.Attwood, Bioinformatics approach
for the classification of GPCRCurrent Opinion in
Phamacology 2003, 3114-120
Using Family Profile
CS774

Use MSA of the family Identify the most highly
conserved regions

Introduction
Profile HMM
SVM-Fisher
SOM
RWDAGCVN RWDSGCVN RWHHGCVQ RWKGACYN RWLWACEQ
12
Method of characterizing family of nucleotide
sequences
CS774

1. Regular expression
AT CG AC ACTG A TG GC
But, cannot distinguish between
highly implausible T G C T - - A G G
and consensus A C A C - - A T C
2. Consensus sequence
A C A C - - A T C
Unclear what consensus means
Need some kind of similarity table between
nucleotides to measure the probability of a
sequence

Introduction
Profile HMM
SVM-Fisher
SOM
13
Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763
HMM
CS774

A model that generates Sequence
A symbol seq. (or observations) is generated
moving of states.
The state seq. is hidden.
- States
- Symbol emission probabilities
- State transition probabilities

Introduction
Profile HMM
SVM-Fisher
SOM
Hidden state sequence, S
Observed symbol sequence, X
P( X,S HMM )
14
3. HMM(Ex. gene sequence)
CS774
Introduction
Insertion State
Profile HMM
SVM-Fisher
SOM
M M M I M M M
15
Deriving HMM Scoring HMM
CS774

Deriving the HMM from a known alignment
Each column in the alignment generates a state
Count the occurrence of ATGC in each column to
determine emission probabilities for each state
Transition probabilities to insertion states in a
similar way (need some caution)

Introduction
Profile HMM
SVM-Fisher
SOM
16
Probability Log-odds
CS774

Probability sequence length(L) dependent
Penalize insertion favor deletion
Log-odds is computed using null model
Considers the overall sequence of nucleotides as
random
Better estimate use overall frequency of
nucleotides(or amino acids) in organisms genome

Introduction
Profile HMM
SVM-Fisher
SOM
17
Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763
Profile HMM
CS774

HMM architecture for representing profiles of
multiple sequence alignments
Linear left-right model
Match state
Insert state
Delete state

Introduction
Profile HMM
SVM-Fisher
SOM
1 2 3C A FC G WC D YC V F C K Y
18
Visual Recognition Tutorial. Thad Starner, Alex
Pentland. Visual Recognition of American sign
Language Using HMMs. In International Workshop on
Automatic Face and Gesture Recognition, pages
189-194, 1995
Elements of Profile HMMs
CS774
Introduction

N the number of hidden states
Q set of states Q1,2,,N
M the number of symbols
V set of symbols V 1,2,,M
A the state-transition probability matrix
B Observation probability distribution
p - the initial state distribution
l the entire model

Profile HMM
SVM-Fisher
SOM
19
Three Basic Problems
CS774

EVALUATION given observation O(o1 , o2 ,,oT )
and model , efficiently
compute
Hidden states complicate the evaluation
Given two models l1 and l2, this can be used to
choose the better one.
DECODING - given observation O(o1 , o2 ,,oT )
and model l find the optimal state sequence q(q1
, q2 ,,qT ) .
Optimality criterion has to be decided (e.g.
maximum likelihood)
Explanation of the data.
LEARNING given O(o1 , o2 ,,oT ), estimate
model parameters that
maximize

Introduction
Profile HMM
SVM-Fisher
SOM
20
Solution to problem 1
CS774

Forward Algorithm

Introduction

Define forward variable as
is the probability of observing the
partial sequence
such that the state
qt is i.
Induction
Initialization
Induction
Termination

Profile HMM
SVM-Fisher
SOM
21
Solution to problem 1
CS774
2. Backward Algorithm
Introduction

Define backward variable as
is the probability of observing the
partial sequence
such that the state
qt is i.
Induction
1. Initialization
2. Induction

Profile HMM
SVM-Fisher
SOM
22
Solution to problem 2
CS774

Choose the most likely path
Find the path (q1 , q2 ,,qT ) that maximizes the
likelihood
Solution by Dynamic Programming
Define
is the highest prob. Path ending in state
I
By induction we have

Introduction
Profile HMM
SVM-Fisher
SOM
23
Solution to problem 2
CS774
Viterbi Algorithm
Introduction

Initialization
Recursion
Termination
Path (state sequence) backtracking

Profile HMM
SVM-Fisher
SOM
24
Solution to problem 3
CS774
Baum-Welch Algorithm
Introduction
Profile HMM

Estimate to maximize
No analytic method because of complexity
iterative solution.
Baum-Welch Algorithm (actually EM algorithm)
Let initial model be l0
Compute new l based on l0 and observation O.
If
Else set l0 l and go to step 2

SVM-Fisher
SOM
25
Preventing Overfitting
CS774

Pseudocount (fake count)
Dangerous to estimate a probability distribution
from just a few examples
pretend you saw an a.a. in a position even
though it wasnt there
Sequence Weighting
Some sequences are more frequent than others
Get more Data!

Introduction
Profile HMM
SVM-Fisher
SOM
26
Pseudocount
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
27
SAM-T98 (software tool)
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
28
SVM-Fisher
29

Jakkola et al, A Discriminative Framework for
Detecting Remote Protein Homologies, Journal of
Computational Biology
Discriminative Framework for Detecting Remote
Protein Homologies
CS774

variant of support vector machines using a new
kernel function
Kernel function
derived from a generative statistical model for
a protein family, in this case HMM
Use generative statistical models built from
multiple sequences, in this case HMMs, as a way
of extracting features from protein sequences.
This maps all protein sequences to points in a
Euclidean feature space of fixed dimension.

Introduction
Profile HMM
SVM-Fisher
SOM
30
Method
CS774

Xx1,xn protein sequence, xi is an amino
acid residue
H1 estimated HMM for particular protein family
P(XH1) corresponding probability model
Likelihood ratio score used in place of a
simple probability P(XH1)

Introduction
Profile HMM
SVM-Fisher
SOM
31
Method
CS774

1. Discriminative approaches
By bayes rule,
P(H1X) posterior probability of the model
Posterior probability that the sequence X
belongs to the protein family being modeled.
Score Function L(X) log posterior odds score

Introduction
Profile HMM
SVM-Fisher
SOM
32
Method
CS774

2. Kernel methods
K(Xi,X) Kernel function
a measure of pairwise similarity between the
training example Xi and the new example X
Sign of the discriminant function L(X) determines
the predicted class for any sequence X

Introduction
Profile HMM
SVM-Fisher
SOM
33
Method
CS774

3. The Fisher kernel
Fisher score
the gradients w.r.t the parameters of the HMM
P(XH1,?) corresponding probability model ,
estimated an HMM for a particular
family of proteins
T include the output and the transition
probabilities of an HMM trained to model

Introduction
Profile HMM
SVM-Fisher
SOM
34
Method
CS774

Fisher score
the gradients w.r.t the parameters of the HMM
Probability value of HMM for each sequence

Introduction
Profile HMM
SVM-Fisher
SOM
35
Method
CS774

Fisher score
Derivatives of w.r.t emission
probabilities

Introduction
Profile HMM
SVM-Fisher
SOM

36
Method
CS774

Fisher score vector relative to the
emission probabilities
a vector whose components indexed by (x,s) and
the corresponding values given by
Dim. Of fisher score vector 20m (m of
state)
- Expected posterior frequency of visiting state
and generating residue

Introduction
Profile HMM
SVM-Fisher
SOM
37
Method
CS774

A natural (squared) distance between the gradient
vectors
Quantify the similarity between two fixed length
gradient vectors Ux and Ux corresponding to two
sequences X and X
Gaussian Kernel

Introduction
Profile HMM
SVM-Fisher
SOM
38
Method
CS774

Method Summary (SVM-Fisher Method)
1. Begin with an HMM trained from positive
examples to model a given protein family.
2. Use this HMM to map each new protein sequence
X we want to classify into a fixed length vector,
its Fisher score
3. Compute the kernel function on the basis of
the Euclidean distance between the score vector
for X and the score vectors for known positive
and negative examples Xi of the protein family.
4. The resulting discriminant function is
given by

Introduction
Profile HMM
SVM-Fisher
SOM
39
Method
CS774

4. Combination of scores
In many cases, we can construct more than one HMM
model for the family or superfamily of interest.
Combine the scores from the multiple models
rather than selecting just one.
Li(X) the score for the query sequence X based
on the ith model

Introduction
Profile HMM
SVM-Fisher
SOM
40
Result
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
41
Result 2
CS774

GPCR level 1 subfamily recognition
GPCR level 2 subfamily recognition

Introduction
Profile HMM
SVM-Fisher
SOM
42
(No Transcript)
43
Reference

A.Gaulton T.K.Attwood, Bioinformatics approach
for the classification of GPCRCurrent Opinion in
Phamacology 2003, 3114-120
Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763
Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763
Jakkola et al, A Discriminative Framework for
Detecting Remote Protein Homologies, Journal of
Computational Biology
Visual Recognition Tutorial. Thad Starner, Alex
Pentland. Visual Recognition of American sign
Language Using HMMs. In International Workshop on
Automatic Face and Gesture Recognition, pages
189-194, 1995

44
Thank You !

Write a Comment

User Comments (0)

About PowerShow.com

Susie Jo PowerPoint PPT Presentation