CS882: Protein Structure Prediction - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

CS882: Protein Structure Prediction

Description:

Linear Program Approach to Protein Threading. Relevance of Protein Structure ... Jackal (http://honiglab.cpmc.columbia.edu/programs/jackal/intro.html) CASP/CAFASP ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 63
Provided by: bioinforma3
Category:

less

Transcript and Presenter's Notes

Title: CS882: Protein Structure Prediction


1
CS882 Protein Structure Prediction
  • Jinbo Xu
  • School of Computer Science, University of Waterloo

2
Outline
  • Protein Structure Basics
  • Protein Structure Prediction
  • Prediction Assessment
  • Linear Program Approach to Protein Threading

3
Relevance of Protein Structurein the Post-Genome
Era
structure
medicine
sequence
function
4
A Protein Sequence
gtgi22330039refNP_683383.1 unknown protein
protein id At1g45196.1 Arabidopsis
thaliana MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSS
ASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL DSARSSFSV
ALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTN
KSSVFPSPGTPTYLHSMQKGW SSERVPLRSNGGRSPPNAGFLPLYSGRT
VPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYS
LY SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPS
MARSVSIHGCSETLASSSQDDIHESMKDAATDA QAVSRRDMATQMSPEG
SIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWS
KKHRGLYHGNGSKM
5
Amino Acids
Side chain
Each amino acid is identified by its side chain,
which determines the properties of this amino
acid.
6
Side Chain Properties
Hydrophobic amino acids stay inside of a protein,
while Hydrophilic ones tend to stay in the
exterior of a protein. Oppositely charged amino
acids can form salt bridge. Polar amino acids can
participate hydrogen bonding
The amino acids names are colored according to
their type positively charged, negatively
charged, polar but not charged, aliphatic
(nonpolar), and aromatic. Amino acids that are
essential to mammals are marked with an asterisk
().
7
Levels of Protein Structures
  • Primary sequence
  • Amino acid (residue) sequence
  • Secondary structure
  • local arrangement, such as helix, beta sheet,
    loop
  • Tertiary structure
  • Overall spatial conformation
  • Quaternary structure
  • Spatial relationship among multiple chains

8
Beta Sheet Examples
Anti-parallel beta sheet
Parallel beta sheet
9
Beta Sheet Examples (Contd)
10
Helix Examples
11
Protein Structure Example
Beta Sheet
Helix
Loop
ID 12as 2 chains
12
What determines structures?
  • Hydrogen bonds essential in stabilizing the
    basic secondary structures
  • Hydrophobic effects strongest determinants of
    protein structures
  • Van der Waal Forces stabilizing the hydrophobic
    cores
  • Electrostatic forces oppositely charged side
    chains form salt bridges

13
Domain, Motif, Fold
  • Domain a discrete portion of a protein assumed
    to fold independently of the rest of  the protein
    and possessing its own function.
  • Most proteins have multiple domains.
  • The overall shape of a domain is called a fold.
    There are only a few thousand possible folds.
  • Super-secondary structure, motif
  • Frequently occurring structure patterns among
    multiple proteins, which are not necessarily have
    similar folds.

14
Protein Structure Prediction
  • Problem
  • Given the amino acid sequence of a protein,
    whats its shape in three-dimensional space?
  • Subproblems
  • Secondary structure prediction
  • Residue-residue contact prediction
  • Angle prediction

15
Why Prediction Needed?
  • The functions of a protein is determined by its
    structure.
  • Experimental methods to determine protein
    structure are time-consuming and expensive.
  • Big gap between the available protein sequences
    and structures.

16
Growth of Protein Sequences and Structures
Data from http//www.dna.affrc.go.jp
17
Protein Classification
  • Family the proteins in the same family are
    homologous, evolved from the same ancestor.
    Usually, the identity of two sequences are very
    high. Similar structures.
  • Super Family distant homologous sequences,
    evolved from the same ancestor. Sequence identity
    is around 25-30. Similar structures.
  • Fold only shapes are similar, no homologous
    relationship. Usually, sequence identity is very
    low.
  • Protein classification databases SCOP, CATH

18
Target Sequence Category
  • Homology Modeling (HM) targets
  • Easy HM has a homologous protein with known
    structure
  • Hard HM has a distant homologous protein with
    known structure
  • Also called Comparative Modeling (CM) targets
  • Fold Recognition (FR) targets
  • Can find a protein with known structure having
    the same fold as the target
  • New Fold (NF) targets

19
Observations
  • Sequences determine structures
  • Proteins fold into minimum energy state.
  • structures are more conserved than sequences. If
    two protein sequences share 30 identical
    residues, then they have a very good chance to
    have the same fold.

20
Prediction Methods
  • Ab initio folding NF targets, build a structure
    without referring to an existing structure
  • Homology Modeling HM targets, sequence-based
    method
  • Protein Threading FR targets, sequence-structure
    alignment
  • Consensus Method vote a prediction from some
    candidates generated by several prediction
    programs

21
Ab Initio Folding
  • Based on the first-principle
  • Build structures purely from protein sequences,
    no templates used
  • Unaffordable computing demands
  • Paradigm is changing, knowledge-based methods are
    proposed

22
Ab Initio Energy Function
23
Lattice Model
  • Arrange all the atoms at some grid points by
    Monte Carlo simulation
  • Pure Lattice Model only works for very small
    proteins
  • Use simplified representation
  • Add constraints such as partial NMR data or
    predicted residue-residue contacts can speed up
    convergence

taken from Jeff. Skolnick et al.
24
Segment Assembly
  • Also called Mini-Threading
  • Bakers method
  • Construct a library of small structure fragments,
    each of which has length 9.
  • Design a threading method to predict the
    structure of a protein sequence segment of length
    9.
  • Cut a target sequence to (len-9) sequence
    segments. For each sequence segment, choose some
    candidate fragments from the fragment library.
  • Assemble these fragments by Monte Carlo
    simulation
  • Thousands of simulations are done. The generated
    structures are grouped into some clusters.
    Clusters are ranked by their average energy
    functions.
  • D. Jones uses super secondary structure as a
    basic construction unit.

25
Homology Modeling
  • Search homologous proteins by sequence search
    tools such as PSI-BLAST
  • Multiple sequence alignment (key step)
  • Identify cores and loops conserved segments are
    cores, otherwise loops
  • Core modeling copy backbone coordinates from the
    homologous one with know structure
  • Loop modeling search fragment library
  • Side chain modeling search rotamer library
  • Refinement some tools such as WHAT IF, PROCHECK,
    and Verify3D can be used

26
Protein Threading
  • Make a structure prediction through finding an
    optimal alignment (placement) of a protein
    sequence onto each known structure (structural
    template)
  • alignment quality is measured by some
    statistics-based scoring function
  • best overall alignment among all templates may
    give a structure prediction

27
Threading Example
28
PDB New Fold Growth
Old fold
New fold
  • The number of unique folds in nature is fairly
    small (possibly a few thousands)
  • 90 of new structures submitted to PDB in the
    past three years have similar structural folds in
    PDB

29
Protein Threading Procedures
  • Step 1 Construction of Template Library
  • Step 2 Design of Scoring Function
  • Step 3 Sequence-Structure Alignment
  • Step 4 Template Selection and Model Construction

Only step 1 is relatively easy!
30
Template Database
  • A representative set of protein structures
    extracted from the PDB database. It satisfies the
    following conditions
  • The resolution of each representative structure
    should be good
  • A good X-ray structure has higher priority than a
    NMR structure
  • The sequence identity between any two
    representatives should be no more than 30, in
    order to save computing time.
  • Examples
  • CATH http//www.biochem.ucl.ac.uk/bsm/cath/
  • SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
  • PDB_SELECT http//www.cmbi.kun.nl/gv/pdbsel/

31
Scoring Function
how well a residue fits a structural
environment E_s (Fitness score)
how preferable to put two particular residues
nearby E_p (Pairwise potential)
sequence similarity between query and template
proteins E_m (Mutation score)
alignment gap penalty E_g (gap score)
How consistent of the secondary structures E_ss
E E_p E_s E_m E_g E_ss
Minimize E to find a sequence-template alignment
32
Nonpairwise Threading Programs
  • Sequence-sequence alignment
  • Sequence-profile alignment
  • Sequence-HMM model alignment
  • e.g. SAMT02 (K. Karplus et al.)
  • Profile-sequence alignment
  • e.g. PDB-Blast (A. Godzik et al.)
  • Profile-profile alignment
  • e.g. PROSPECT-II (Y. Xu et al.)
  • Combinations of several alignments
  • e.g. 3DPS (L.A. Kelley et al), SHGU (D. Fischer)

33
Pairwise Threading Algorithms
  • Approximation Algorithm
  • Interaction-Frozen Algorithm (A. Godzik et al.)
  • Monte Carlo Sampling (S.H. Bryant et al.)
  • Double dynamic programming (D. Jones et al.)
  • Exact Algorithm
  • Branch-and-bound (R.H. Lathrop and T.F. Smith)
  • PROSPECT-I uses Divide-and-conquer (Y. Xu et al.)
  • Linear programming by RAPTOR (J. Xu et al.)

34
Fold Recognition and Model Building
  • Protein Threading algorithms can only generate
    sequence-template alignments
  • Correct templates should be chosen from the
    template database
  • zScore, Neural Network, SVM
  • Some tools are needed to generate 3D structure of
    the sequence from the sequencetemplate alignment
  • MODELLER (http//salilab.org/modeller/modeller.htm
    l)
  • MaxSprout (http//www.ebi.ac.uk/maxsprout/)
  • Jackal (http//honiglab.cpmc.columbia.edu/programs
    /jackal/intro.html)

35
CASP/CAFASP
  • CASP Critical Assessment of Structure Prediction
  • CAFASP Critical Assessment of Fully Automated
    Structure Prediction

CASP Predictor
CAFASP Predictor
  • Wont get tired
  • High-throughput

36
CASP/CAFASP (contd)
  • Public
  • Organized by structure community
  • Evaluated by the unbiased third-party
  • Held every two years
  • Blind
  • Experimental structures to be determined by
    structure centers after competition
  • Drawback lt100 targets
  • Blindness
  • Some centers are reluctant to release their
    structures

37
Threading Model
  • Each template is parsed as a chain of cores. Two
    adjacent cores are connected by a loop. Cores are
    the most conserved segments in a protein.
  • No gap allowed within a core.
  • Only the pairwise contact between two core
    residues are considered because contacts involved
    with loop residues are not conserved well.
  • Global alignment employed

38
Contact Graph
  • Each residue as a vertex
  • One edge between two residues if their spatial
    distance is within given cutoff.
  • Cores are the most conserved segments in the
    template

template
39
Simplified Contact Graph
40
Contact Graph and Alignment Diagram
41
Contact Graph and Alignment Diagram
42
Calculation of Alignment Score
43
Hardness of Protein Threading
  • Protein Threading is NP-hard
  • Proof
  • Reduce Max-Cut problem to this problem
  • Given a graph, number its nodes by certain order.
    Assume there are M nodes.
  • Consider a sequence of length 2M like this
    PHPHPH
  • Pairwise score is 1 only if two different types
    of residues are mapped to two ends of one graph
    edge, otherwise 0

44
Linear Integer Program
maximize
z 6x5y
Linear function
Subject to
Linear Program
Integer Program
3xylt11 -x2ylt5 x, ygt0
Linear contraints
Integral contraints (nonlinear)
x, y integer
45
Linear Integer Program
  • Linear programs can be solved within polynomial
    time
  • No polynomial time for integer programs so far
  • Relaxed to linear program, solve the linear
    version
  • Branch-and-bound or branch-and-cut (may cost
    exponential time)

46
Variables
  • x(i,l) denotes core i is aligned to sequence
    position l
  • y(i,l,j,k) denotes that core i is aligned to
    position l and core j is aligned to position k at
    the same time.

47
Standard Formulation
a singleton score parameter b pairwise score
parameter
Each y variable is 1 if and only if its two x
variable are 1
Each core has only one alignment position
48
Better Formulation
a singleton score parameter b pairwise score
parameter
Each y variable is 1 if and only if its two x
variable are 1
Each core has only one alignment position
  • 99 real threading instances generate integral
    solutions directly
  • The fractional solution space of this formulation
    is a subset of that of the previous one

49
Integrality of Real Instances
50
Correlation Coefficient between Fractional
Solutions and Templates, Sequences
edges the number of edges in the template
contact graph.
51
Integrality Summary
  • 99 real instances could be solved by linear
    programming directly, no additional
    branch-and-bound needed
  • No special template or sequence found generates
    more fractional solutions
  • No special feature of templates or sequences
    found leads to fractional solutions
  • Explanation?

52
zScore to Fold Recognition
  • It is defined to be deviation of the alignment
    score to the expected
  • Expected alignment score is calculated by random
    shuffle
  • fixing the sequence-template alignment positions
  • Randomly shuffle the sequence and recalculate the
    alignment score
  • Calculate the mean and variance of the alignment
    scores generated by random shuffling
  • zScore equals to (mean-alignment score)/variance

53
SVM to Fold Recognition
  • SVM classification to classify threading pairs
  • Features
  • Template length, sequence length, alignment
    length
  • Mutation score, fitness score, secondary
    structure score, pairwise score
  • Gap penalty

54
CAFASP3/CASP5
  • Same Target Set 62 targets
  • Time for each target
  • Individual Servers 48 hours
  • Meta Servers 96 hours
  • CASP5 Predictors May to September of 2002
  • Resources for predictors
  • No X-ray, NMR machines (of course)
  • CAFASP3 predictors no manual intervention
  • CASP5 predictors anything (servers, google,)
  • Evaluation
  • CASP5 assessed by expertscomputer
  • CAFASP3 evaluated by MaxSub, a computer program.
    Predicted structures are superimposed to the
    experimental structures to see how long is
    superimposable.

55
CASP5/CAFASP3 targets
Hard
Easy
Prediction Difficulty
CM Comparative Modelling, HM Homology
Modelling FR Fold Recogniton, NF New Fold
56
State of the Art
Servers with name in italic are meta servers
(http//ww.cs.bgu.ac.il/dfischer/CAFASP3,
released in December, 2002.)
57
RAPTORs sensitivity on FR targets
1. RAPTOR is weak at recognizing FR(A) targets
(need improvement ) 2. RAPTOR could not deal with
NF targets at all (normal)
58
CAFASP3 Example
  • Target Size144
  • Super-imposable size within 5A 118
  • RMSD1.9

Blue Correct Prediction
Red Experimental Structure
Green Incorrect Prediction
59
Term Project Options
  • Design a machine learning approach to fold
    recognition
  • Design a template database from scratch, based on
    only the PDB database, and compare it to other
    databases
  • Protein secondary structure prediction
  • Local alignment for protein threading
  • Critical review of three protein threading
    algorithms branch-and-bound, divide-and-conquer,
    and linear programming

60
Open Questions
  • A practical and exact algorithm to protein
    threading problem allowing gaps within cores
  • A practical and exact algorithm to ab initio
    protein folding problem
  • Investigate the conditions under which the linear
    program will generate integral solutions directly

61
Acknowledgements
  • Bonnie Berger, Introduction to Computational
    Molecular Biology, course notes, 2001
  • Bin Ma, Bioinformatics, course notes, 2004

62
Reading List
  • CASP1, CASP2, CASP3, CASP4, and CASP5 Special
    Issues, Proteins Structure, Function and
    Genetics, 1995, 1997, 1999, 2001, 2003
Write a Comment
User Comments (0)
About PowerShow.com