Markov chains - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Markov chains

Description:

From Markov Chains to Hidden Markov Models (HMM) ... Hidden Markov Model. for protein sequences. three types ... E.g. SCOP http://scop.mrc-lmb.cam.ac.uk/scop ... – PowerPoint PPT presentation

Number of Views:891
Avg rating:3.0/5.0
Slides: 25
Provided by: werner6
Category:
Tags: cam | chains | hidden | markov

less

Transcript and Presenter's Notes

Title: Markov chains


1
Markov chains
  • Basic structure of a classical Markov chain
  • example DNA each letter A,C,G,T can be assigned
    as a state with transition probabilities
    P(XitXi-1s)

Probability of each state xi depends only on the
value of the preceding symbol x i-1
Sum of probabilities over all possible sequences
is 1. A Markov chain describes a proper
probability distribution over the whole space of
sequences.
2
CpG islands Biological function and impact on
gene regulation
  • Genomic regions with CpG dinucleotide content of
    at least 60
  • Overall genomes have much lower CpG frequency
    (1), (CG suppression).
  • Methylation of CpG sites in the promoter of a
    gene may inhibit the expression of a gene.
  • CpG islands typically occur at or near the
    transcription start site of genes, particularly
    in housekeeping genes of vertebrates

3
Example CpG island
  • short stretches of 1001000 nucleotides with
    frequently occuring CpG dinucleotides
  • Given a short stretch of genomic sequence, how
    can one decide it comes from a CpG island?
  • How can one find CpG islands within a long
    stretch of a genomic sequence?
  • Learning sets
  • 48 CpG islands ( set)
  • sequences outside CpG islands (- set)
  • Likelihood ratio test

4
From Markov Chains to Hidden Markov Models (HMM)
  • How to find CpG islands in an annotated sequence?
  • one approach moving window of 100 nucleotides
  • another approach include both markov chains
    (model and model -) in one model i.e. switch
    between the two models
  • A,C,G,T and A-,C-,G-,T-
  • State does not correspond to the symbols any
    more A and A- both generate A, from the symbol
    A alone one cannot infer where it is coming from
    (hidden) transition probabilities are different
    in A or A-.
  • Graphically

5
Globin sequences
6
Hidden Markov Modelfor protein sequences
  • three types of states
  • match states mj insertion states ij gap states
    dj

For each line there is a transition probability
Sequence of states (pi) and sequence of symbols
(xj) are decoupled The probability for a path of
states is still a Markov Chain The match
states m emit sequence i.e it corresponds to a
column in a multiple sequence alignment.
7
Parametrization of an HMM
  • Alignment of a sequence to a model is associated
    with a match or an insert state.
  • An alignment of a sequence to a model is called a
    path
  • the alignment is not unique
  • Emission probabilities ek(b)

In practice estimates for the emission
probabilities are taken from multiple sequence
alignments.
8
Searching with profile HMM
  • Given a sequence x one has to find the most
    probable path p (alignment) to the model
  • Viterbi algorithm
  • and the probability of the optimal alignment
    gives a score how well a sequence fits into the
    protein family.
  • Probability of an alignment of a sequence to a
    model

Practical aspects Log Likehood scoring
LL log-odds score, i.e. comparison to a random
model Example modelling and searching for
globins
9
profile HMM
Ref - Durbin, Eddy, Krogh, Mitchison,
Biological sequence analysis, Cambridge
University Press.
10
Comparison to 1-d profiles
multiple sequence alignment of Fig. 5.3
f(j,b) probability of amino acid b at position
j profile p the expected score for a given
sequence yj to fit into the family Parameters for
HMM (adhoc rule pseudocounts for residues not
observed)
11
Log likelhood score and log-odds score
Ref - Durbin, Eddy, Krogh, Mitchison,
Biological sequence analysis, Cambridge
University Press.
12
Z-score
13
Concepts of protein structure prediction
  • Why is there a need for protein structure
    prediction ?
  • the sequence of a protein is easily available
  • the determination of 3D structures is still a
    slow process
  • energy based methods
  • free energy of the protein in the native state
    is minimal
  • Anfinson experiment
  • ab initio structure prediction is still an
    unsolved problem
  • holy grail of computational biology
  • knowledge-based methods
  • parameters are extracted from currently known 3D
    structures
  • examples
  • secondary structure prediction
  • fold recognition methods (threading)
  • knowledge based force field terms are added to
    free energy term

14
  • (2) prediction of secondary structure and long
    range contacts
  • Secondary structure prediction derive propensity
    values of residues from statistical analysis of
    residues in known secondary structure
  • More sophisticated methods Neural Network,
    combined prediction from MSA and HMM
  • Long range contacts
  • Tree-determinant residues
  • Motifs
  • Correlated mutations

15
Comparative Homology Modeling
  • 3D structures of proteins come in families and
    superfamilies
  • E.g. SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
  • families sequence identities high ( 35), same
    functional residues
  • superfamilies similar 3D fold some common
    functional motifs
  • No universal definition of superfamilies
  • . folds similar 3D fold
  • Rule of thumb if two proteins have an alignment
    with a sequence identity 30 they have the same
    fold.
  • More sophisticated methods for fold recognition
    3D profiles or threading
  • Steps
  • - for a target sequence find a homologous PDB
    template structure,
  • - make an optimum alignment between the target
    and template sequences,
  • - generate the the tertiary structure of the
    target using the template geometry.

16
Additional considerations
What is the secondary structure? Is it homologous
to other protein sequences? Is it homologous to
other protein structures? What is the best
sequence alignment between your target protein
and homologous PDB structures? Examine the
regions of insertions and deletions. Are they
located in the loop regions? On the surface? Is
the region hydrophobic or hydrophilic? The PDB
template might have functional sites and
established motifs. Does your target sequence has
the same features? If disulphide bridges are
present in the PDB template, are cysteine
residues aligned?
17
Methods for prediction
  • Classic method (Chou Fasman, 1985)
  • simplified rules
  • separate amino acids into groups of helix
    (b-strand) formers and breakers
  • search for clusters of formers (four h-former out
    of six contiguous residues three b-former out of
    five residues extend the segments in both
    dimensions until a tetrapeptide of breakers is
    found
  • later improvements
  • Garnier Osguthorpe Robson (GOR) method
  • influence of residue at postion j on secondary
    structure in the neighborhood of the residue is
    included
  • main effect is statistically found in the range
    j-8

18
Improvements of the methods
major improvements larger databases multiple
sequence alignments neural network
method consensus prediction Meta server
19
Neural network
Topology of a neural network each node
represents a number between 0 and one nodes of
input layer In hidden layer output layer
Switch function
The parameters w and b are determined by training
the net If wkn is very large, the influence of
node n on node k is very sensitive
20
Training of the network
The training of the network requires (many) pairs
of (Ip,Op) given to the network, and adjusting w
and b to obtain an optimal fit. The number of
training patterns should be at least 3 to 5 times
larger than the number of adjustable parameters w
and b, to avoid overfitting. The network learns
by minimizing
where Opcalc is calculated by the NN from Ip
  • T can be minimized by an iterative process
  • choose w and b randomly
  • change w and b according to the steepest descent
    method

21
Secondary structure prediction by NN
(A) Basic scheme Output layer represents the
secondary structure, i.e. a,b, c Input layer
each amino is treated as as a separate node, 20
nodes additional nodes for gaps and
insertions (B) Topology of the network first
level sequence to structure the window Xi-6 to
Xi6 determines the secondary structure at
position i second level structure to structure a
window of predicted sec. structures with mixed
predictions e.g. aabaabcaabbaaac determines a
helical or b-strand region third level jury
decision over independently trained networks
Ref Rost, B. PHD predicting one-dimensional
protein structure by profile based neural
networks. Methods in Enzymology 266525-539, 1996
22
Secondary Structure Prediction Servers
  • APSSP2 www.imtech.res.in/raghava/apssp2/
  • Advanced Protein Secondary Structure Prediction
    Server, GPS Raghava, Bioinformatics Center,
    Chandigarh
  • PSIPRED bioinf.cs.ucl.ac.uk/psipred/index.html
  • The PSIPRED Protein Structure Prediction Server,
    D. T. Jones, Department of Computer Science,
    University College London, UK.
  • PROF www.aber.ac.uk/phiwww/prof/
  • University of Wales, Aberystwyth, Computational
    Biology Group.
  • PredictProtein cubic.bioc.columbia.edu/predictpro
    tein/
  • The PredictProtein server , B. Rost, Columbia
    University, NY.
  • SAM-T02sec www.cse.ucsc.edu/research/compbio/HMM-
    apps/T02-query.html
  • HMM methods, K. Karplus, UCSC
  • JPRED www.compbio.dundee.ac.uk/www-jpred/
  • A consensus method for protein secondary
    structure prediction
  • G. Barton, University of Dundee

23
Performance of secondary structure prediction
methods in CAFASP
Ref Eyrich et. al. Proteins, 53548-560 (2003).
24
Performance of secondary prediction methods on a
larger set
Write a Comment
User Comments (0)
About PowerShow.com