Susie Jo - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Susie Jo

Description:

Protein, DNA can be encoded in primary sequence ... related to the 'light receptor' rhodopsin and 2-adrenergic receptor (family A) ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 45
Provided by: jiha
Category:
Tags: rhodopsin | scop | susie

less

Transcript and Presenter's Notes

Title: Susie Jo


1
Protein Family Classification Using AI Techniques
(Profile-HMMs, SVM)
2003. 12.04 Susie Jo Bio-Information System
Laboratory BioSystem Dept., KAIST
2
Table of Contents
Intro
Profile HMM
SVM-Fisher
3
Protein homology detection
CS774
  • Protein, DNA can be encoded in primary sequence
  • (amino acid residue20 types,
    nucleotideA/G/C/T)
  • Functionally Annotated Sequence
  • Functionally Unknown Sequence

Introduction
Profile HMM
SVM-Fisher
Similar Sequence
Similar Function
SOM
Sequence Similarity
4
Protein Classification SCOP (Structural
Classification of Protein Database)
CS774
  • SCOP hierarchy of protein domains

Introduction
Profile HMM
SVM-Fisher
SOM
Primary Level
Varying degrees of similarity
5
GPCR
CS774
  • The three major subfamilies include the receptors
    related to the light receptor rhodopsin and
    ß2-adrenergic receptor (family A)
  • can be subdivided into six major subgroups
  • overall homology among all type A receptors is
    low
  • highly conserved a few key residues
  • Asp-Arg-Tyr (DRY) motif
  • Receptors related to the glucagon receptor
    (family B)
  • The receptors related to the metabotropic
    neurotransmitter receptors (family C)
  • Yeast pheromone receptor (family D, E)
  • cAMP recptors (family F)

Introduction
Profile HMM
SVM-Fisher
SOM
6
GPCR 3 major subfamilies
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
Low Sequence Homology
7
Protein remote homology detection Methods
CS774
  • Pair-wise similarities between proteins
  • Simple Sequence Similarity
  • Using Smith-Waterman dynamic programming
  • BLAST, FASTA
  • )Simple, Easy
  • - )Low accuracy

Introduction
Profile HMM
SVM-Fisher
SOM
Urotensin is very similar with 4 somatostatin gt
However, actually they have different ligands
8
Protein remote homology detection Methods
CS774
  • Profiles and hidden Markov models(HMMs)
  • Profile-based methods by iteratively collecting
  • homologous sequences from a large database and
  • incorporating the resulting statistics into a
    single model
  • PSI-BLAST and SAM-T98
  • 3. SVM-Fisher method(Jaakkola et al.,
    1999,2000)
  • couples an iterative HMM training scheme with
    the SVM

Introduction
Profile HMM
SVM-Fisher
SOM
9
(No Transcript)
10
Motivation
CS774
  • Objective Given a family of related sequences,
    what is an effective way to capture what they
    have in common, so that we can recognize other
    members of the family.
  • Some standard methods for characterization
  • - Multiple alignments
  • - Profile
  • - Regular Expressions
  • - Consensus Sequences
  • - Hidden Markov Models

Introduction
Profile HMM
SVM-Fisher
SOM
11
A.Gaulton T.K.Attwood, Bioinformatics approach
for the classification of GPCRCurrent Opinion in
Phamacology 2003, 3114-120
Using Family Profile
CS774
  • Use MSA of the family Identify the most highly
    conserved regions

Introduction
Profile HMM
SVM-Fisher
SOM
RWDAGCVN RWDSGCVN RWHHGCVQ RWKGACYN RWLWACEQ
12
Method of characterizing family of nucleotide
sequences
CS774
  • 1. Regular expression
  • AT CG AC ACTG A TG GC
  • But, cannot distinguish between
  • highly implausible T G C T - - A G G
  • and consensus A C A C - - A T C
  • 2. Consensus sequence
  • A C A C - - A T C
  • Unclear what consensus means
  • Need some kind of similarity table between
    nucleotides to measure the probability of a
    sequence

Introduction
Profile HMM
SVM-Fisher
SOM
13
Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763
HMM
CS774
  • A model that generates Sequence
  • A symbol seq. (or observations) is generated
    moving of states.
  • The state seq. is hidden.
  • - States
  • - Symbol emission probabilities
  • - State transition probabilities

Introduction
Profile HMM
SVM-Fisher
SOM
Hidden state sequence, S
Observed symbol sequence, X
P( X,S HMM )
14
3. HMM(Ex. gene sequence)
CS774
Introduction
Insertion State
Profile HMM
SVM-Fisher
SOM
M M M I M M M
15
Deriving HMM Scoring HMM
CS774
  • Deriving the HMM from a known alignment
  • Each column in the alignment generates a state
  • Count the occurrence of ATGC in each column to
    determine emission probabilities for each state
  • Transition probabilities to insertion states in a
    similar way (need some caution)

Introduction
Profile HMM
SVM-Fisher
SOM
16
Probability Log-odds
CS774
  • Probability sequence length(L) dependent
  • Penalize insertion favor deletion
  • Log-odds is computed using null model
  • Considers the overall sequence of nucleotides as
    random
  • Better estimate use overall frequency of
    nucleotides(or amino acids) in organisms genome

Introduction
Profile HMM
SVM-Fisher
SOM
17
Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763
Profile HMM
CS774
  • HMM architecture for representing profiles of
    multiple sequence alignments
  • Linear left-right model
  • Match state
  • Insert state
  • Delete state

Introduction
Profile HMM
SVM-Fisher
SOM
1 2 3C A FC G WC D YC V F C K Y
18
Visual Recognition Tutorial. Thad Starner, Alex
Pentland. Visual Recognition of American sign
Language Using HMMs. In International Workshop on
Automatic Face and Gesture Recognition, pages
189-194, 1995
Elements of Profile HMMs
CS774
Introduction
  • N the number of hidden states
  • Q set of states Q1,2,,N
  • M the number of symbols
  • V set of symbols V 1,2,,M
  • A the state-transition probability matrix
  • B Observation probability distribution
  • p - the initial state distribution
  • l the entire model

Profile HMM
SVM-Fisher
SOM
19
Three Basic Problems
CS774
  • EVALUATION given observation O(o1 , o2 ,,oT )
    and model , efficiently
    compute
  • Hidden states complicate the evaluation
  • Given two models l1 and l2, this can be used to
    choose the better one.
  • DECODING - given observation O(o1 , o2 ,,oT )
    and model l find the optimal state sequence q(q1
    , q2 ,,qT ) .
  • Optimality criterion has to be decided (e.g.
    maximum likelihood)
  • Explanation of the data.
  • LEARNING given O(o1 , o2 ,,oT ), estimate
    model parameters that
    maximize

Introduction
Profile HMM
SVM-Fisher
SOM
20
Solution to problem 1
CS774
  • Forward Algorithm

Introduction
  • Define forward variable as
  • is the probability of observing the
    partial sequence
  • such that the state
    qt is i.
  • Induction
  • Initialization
  • Induction
  • Termination

Profile HMM
SVM-Fisher
SOM
21
Solution to problem 1
CS774
2. Backward Algorithm
Introduction
  • Define backward variable as
  • is the probability of observing the
    partial sequence
  • such that the state
    qt is i.
  • Induction
  • 1. Initialization
  • 2. Induction

Profile HMM
SVM-Fisher
SOM
22
Solution to problem 2
CS774
  • Choose the most likely path
  • Find the path (q1 , q2 ,,qT ) that maximizes the
    likelihood
  • Solution by Dynamic Programming
  • Define
  • is the highest prob. Path ending in state
    I
  • By induction we have

Introduction
Profile HMM
SVM-Fisher
SOM
23
Solution to problem 2
CS774
Viterbi Algorithm
Introduction
  • Initialization
  • Recursion
  • Termination
  • Path (state sequence) backtracking

Profile HMM
SVM-Fisher
SOM
24
Solution to problem 3
CS774
Baum-Welch Algorithm
Introduction
Profile HMM
  • Estimate to maximize
  • No analytic method because of complexity
    iterative solution.
  • Baum-Welch Algorithm (actually EM algorithm)
  • Let initial model be l0
  • Compute new l based on l0 and observation O.
  • If
  • Else set l0 l and go to step 2

SVM-Fisher
SOM
25
Preventing Overfitting
CS774
  • Pseudocount (fake count)
  • Dangerous to estimate a probability distribution
    from just a few examples
  • pretend you saw an a.a. in a position even
    though it wasnt there
  • Sequence Weighting
  • Some sequences are more frequent than others
  • Get more Data!

Introduction
Profile HMM
SVM-Fisher
SOM
26
Pseudocount
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
27
SAM-T98 (software tool)
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
28
SVM-Fisher
29

Jakkola et al, A Discriminative Framework for
Detecting Remote Protein Homologies, Journal of
Computational Biology
Discriminative Framework for Detecting Remote
Protein Homologies
CS774
  • variant of support vector machines using a new
    kernel function
  • Kernel function
  • derived from a generative statistical model for
    a protein family, in this case HMM
  • Use generative statistical models built from
    multiple sequences, in this case HMMs, as a way
    of extracting features from protein sequences.
    This maps all protein sequences to points in a
    Euclidean feature space of fixed dimension.

Introduction
Profile HMM
SVM-Fisher
SOM
30
Method
CS774
  • Xx1,xn protein sequence, xi is an amino
    acid residue
  • H1 estimated HMM for particular protein family
  • P(XH1) corresponding probability model
  • Likelihood ratio score used in place of a
    simple probability P(XH1)

Introduction
Profile HMM
SVM-Fisher
SOM
31
Method
CS774
  • 1. Discriminative approaches
  • By bayes rule,
  • P(H1X) posterior probability of the model
  • Posterior probability that the sequence X
    belongs to the protein family being modeled.
  • Score Function L(X) log posterior odds score

Introduction
Profile HMM
SVM-Fisher
SOM
32
Method
CS774
  • 2. Kernel methods
  • K(Xi,X) Kernel function
  • a measure of pairwise similarity between the
    training example Xi and the new example X
  • Sign of the discriminant function L(X) determines
    the predicted class for any sequence X

Introduction
Profile HMM
SVM-Fisher
SOM
33
Method
CS774
  • 3. The Fisher kernel
  • Fisher score
  • the gradients w.r.t the parameters of the HMM
  • P(XH1,?) corresponding probability model ,
    estimated an HMM for a particular
    family of proteins
  • T include the output and the transition
    probabilities of an HMM trained to model

Introduction
Profile HMM
SVM-Fisher
SOM
34
Method
CS774
  • Fisher score
  • the gradients w.r.t the parameters of the HMM
  • Probability value of HMM for each sequence

Introduction
Profile HMM
SVM-Fisher
SOM
35
Method
CS774
  • Fisher score
  • Derivatives of w.r.t emission
    probabilities

Introduction
Profile HMM
SVM-Fisher
SOM

36
Method
CS774
  • Fisher score vector relative to the
    emission probabilities
  • a vector whose components indexed by (x,s) and
    the corresponding values given by
  • Dim. Of fisher score vector 20m (m of
    state)
  • - Expected posterior frequency of visiting state
    and generating residue

Introduction
Profile HMM
SVM-Fisher
SOM
37
Method
CS774
  • A natural (squared) distance between the gradient
    vectors
  • Quantify the similarity between two fixed length
    gradient vectors Ux and Ux corresponding to two
    sequences X and X
  • Gaussian Kernel

Introduction
Profile HMM
SVM-Fisher
SOM
38
Method
CS774
  • Method Summary (SVM-Fisher Method)
  • 1. Begin with an HMM trained from positive
    examples to model a given protein family.
  • 2. Use this HMM to map each new protein sequence
    X we want to classify into a fixed length vector,
    its Fisher score
  • 3. Compute the kernel function on the basis of
    the Euclidean distance between the score vector
    for X and the score vectors for known positive
    and negative examples Xi of the protein family.
  • 4. The resulting discriminant function is
    given by

Introduction
Profile HMM
SVM-Fisher
SOM
39
Method
CS774
  • 4. Combination of scores
  • In many cases, we can construct more than one HMM
    model for the family or superfamily of interest.
  • Combine the scores from the multiple models
    rather than selecting just one.
  • Li(X) the score for the query sequence X based
    on the ith model

Introduction
Profile HMM
SVM-Fisher
SOM
40
Result
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
41
Result 2
CS774
  • GPCR level 1 subfamily recognition
  • GPCR level 2 subfamily recognition

Introduction
Profile HMM
SVM-Fisher
SOM
42
(No Transcript)
43
Reference
  • A.Gaulton T.K.Attwood, Bioinformatics approach
    for the classification of GPCRCurrent Opinion in
    Phamacology 2003, 3114-120
  • Sean R. Eddy, Profile hidden Markov models
    Bioinformatics vol.14, no.9 1998, 755-763
  • Sean R. Eddy, Profile hidden Markov models
    Bioinformatics vol.14, no.9 1998, 755-763
  • Jakkola et al, A Discriminative Framework for
    Detecting Remote Protein Homologies, Journal of
    Computational Biology
  • Visual Recognition Tutorial. Thad Starner, Alex
    Pentland. Visual Recognition of American sign
    Language Using HMMs. In International Workshop on
    Automatic Face and Gesture Recognition, pages
    189-194, 1995

44
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com