CS882: Protein Structure Prediction presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS882: Protein Structure Prediction

1
CS882 Protein Structure Prediction

Jinbo Xu
School of Computer Science, University of Waterloo

2
Outline

Protein Structure Basics
Protein Structure Prediction
Prediction Assessment
Linear Program Approach to Protein Threading

3
Relevance of Protein Structurein the Post-Genome
Era
structure
medicine
sequence
function
4
A Protein Sequence
gtgi22330039refNP_683383.1 unknown protein
protein id At1g45196.1 Arabidopsis
thaliana MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSS
ASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL DSARSSFSV
ALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTN
KSSVFPSPGTPTYLHSMQKGW SSERVPLRSNGGRSPPNAGFLPLYSGRT
VPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYS
LY SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPS
MARSVSIHGCSETLASSSQDDIHESMKDAATDA QAVSRRDMATQMSPEG
SIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWS
KKHRGLYHGNGSKM
5
Amino Acids
Side chain
Each amino acid is identified by its side chain,
which determines the properties of this amino
acid.
6
Side Chain Properties
Hydrophobic amino acids stay inside of a protein,
while Hydrophilic ones tend to stay in the
exterior of a protein. Oppositely charged amino
acids can form salt bridge. Polar amino acids can
participate hydrogen bonding
The amino acids names are colored according to
their type positively charged, negatively
charged, polar but not charged, aliphatic
(nonpolar), and aromatic. Amino acids that are
essential to mammals are marked with an asterisk
().
7
Levels of Protein Structures

Primary sequence
Amino acid (residue) sequence
Secondary structure
local arrangement, such as helix, beta sheet,
loop
Tertiary structure
Overall spatial conformation
Quaternary structure
Spatial relationship among multiple chains

8
Beta Sheet Examples
Anti-parallel beta sheet
Parallel beta sheet
9
Beta Sheet Examples (Contd)
10
Helix Examples
11
Protein Structure Example
Beta Sheet
Helix
Loop
ID 12as 2 chains
12
What determines structures?

Hydrogen bonds essential in stabilizing the
basic secondary structures
Hydrophobic effects strongest determinants of
protein structures
Van der Waal Forces stabilizing the hydrophobic
cores
Electrostatic forces oppositely charged side
chains form salt bridges

13
Domain, Motif, Fold

Domain a discrete portion of a protein assumed
to fold independently of the rest of the protein
and possessing its own function.
Most proteins have multiple domains.
The overall shape of a domain is called a fold.
There are only a few thousand possible folds.
Super-secondary structure, motif
Frequently occurring structure patterns among
multiple proteins, which are not necessarily have
similar folds.

14
Protein Structure Prediction

Problem
Given the amino acid sequence of a protein,
whats its shape in three-dimensional space?
Subproblems
Secondary structure prediction
Residue-residue contact prediction
Angle prediction

15
Why Prediction Needed?

The functions of a protein is determined by its
structure.
Experimental methods to determine protein
structure are time-consuming and expensive.
Big gap between the available protein sequences
and structures.

16
Growth of Protein Sequences and Structures
Data from http//www.dna.affrc.go.jp
17
Protein Classification

Family the proteins in the same family are
homologous, evolved from the same ancestor.
Usually, the identity of two sequences are very
high. Similar structures.
Super Family distant homologous sequences,
evolved from the same ancestor. Sequence identity
is around 25-30. Similar structures.
Fold only shapes are similar, no homologous
relationship. Usually, sequence identity is very
low.
Protein classification databases SCOP, CATH

18
Target Sequence Category

Homology Modeling (HM) targets
Easy HM has a homologous protein with known
structure
Hard HM has a distant homologous protein with
known structure
Also called Comparative Modeling (CM) targets
Fold Recognition (FR) targets
Can find a protein with known structure having
the same fold as the target
New Fold (NF) targets

19
Observations

Sequences determine structures
Proteins fold into minimum energy state.
structures are more conserved than sequences. If
two protein sequences share 30 identical
residues, then they have a very good chance to
have the same fold.

20
Prediction Methods

Ab initio folding NF targets, build a structure
without referring to an existing structure
Homology Modeling HM targets, sequence-based
method
Protein Threading FR targets, sequence-structure
alignment
Consensus Method vote a prediction from some
candidates generated by several prediction
programs

21
Ab Initio Folding

Based on the first-principle
Build structures purely from protein sequences,
no templates used
Unaffordable computing demands
Paradigm is changing, knowledge-based methods are
proposed

22
Ab Initio Energy Function
23
Lattice Model

Arrange all the atoms at some grid points by
Monte Carlo simulation
Pure Lattice Model only works for very small
proteins
Use simplified representation
Add constraints such as partial NMR data or
predicted residue-residue contacts can speed up
convergence

taken from Jeff. Skolnick et al.
24
Segment Assembly

Also called Mini-Threading
Bakers method
Construct a library of small structure fragments,
each of which has length 9.
Design a threading method to predict the
structure of a protein sequence segment of length
9.
Cut a target sequence to (len-9) sequence
segments. For each sequence segment, choose some
candidate fragments from the fragment library.
Assemble these fragments by Monte Carlo
simulation
Thousands of simulations are done. The generated
structures are grouped into some clusters.
Clusters are ranked by their average energy
functions.
D. Jones uses super secondary structure as a
basic construction unit.

25
Homology Modeling

Search homologous proteins by sequence search
tools such as PSI-BLAST
Multiple sequence alignment (key step)
Identify cores and loops conserved segments are
cores, otherwise loops
Core modeling copy backbone coordinates from the
homologous one with know structure
Loop modeling search fragment library
Side chain modeling search rotamer library
Refinement some tools such as WHAT IF, PROCHECK,
and Verify3D can be used

26
Protein Threading

Make a structure prediction through finding an
optimal alignment (placement) of a protein
sequence onto each known structure (structural
template)
alignment quality is measured by some
statistics-based scoring function
best overall alignment among all templates may
give a structure prediction

27
Threading Example
28
PDB New Fold Growth
Old fold
New fold

The number of unique folds in nature is fairly
small (possibly a few thousands)
90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB

29
Protein Threading Procedures

Step 1 Construction of Template Library
Step 2 Design of Scoring Function
Step 3 Sequence-Structure Alignment
Step 4 Template Selection and Model Construction

Only step 1 is relatively easy!
30
Template Database

A representative set of protein structures
extracted from the PDB database. It satisfies the
following conditions
The resolution of each representative structure
should be good
A good X-ray structure has higher priority than a
NMR structure
The sequence identity between any two
representatives should be no more than 30, in
order to save computing time.

Examples
CATH http//www.biochem.ucl.ac.uk/bsm/cath/
SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
PDB_SELECT http//www.cmbi.kun.nl/gv/pdbsel/

31
Scoring Function
how well a residue fits a structural
environment E_s (Fitness score)
how preferable to put two particular residues
nearby E_p (Pairwise potential)
sequence similarity between query and template
proteins E_m (Mutation score)
alignment gap penalty E_g (gap score)
How consistent of the secondary structures E_ss
E E_p E_s E_m E_g E_ss
Minimize E to find a sequence-template alignment
32
Nonpairwise Threading Programs

Sequence-sequence alignment
Sequence-profile alignment
Sequence-HMM model alignment
e.g. SAMT02 (K. Karplus et al.)
Profile-sequence alignment
e.g. PDB-Blast (A. Godzik et al.)
Profile-profile alignment
e.g. PROSPECT-II (Y. Xu et al.)
Combinations of several alignments
e.g. 3DPS (L.A. Kelley et al), SHGU (D. Fischer)

33
Pairwise Threading Algorithms

Approximation Algorithm
Interaction-Frozen Algorithm (A. Godzik et al.)
Monte Carlo Sampling (S.H. Bryant et al.)
Double dynamic programming (D. Jones et al.)
Exact Algorithm
Branch-and-bound (R.H. Lathrop and T.F. Smith)
PROSPECT-I uses Divide-and-conquer (Y. Xu et al.)
Linear programming by RAPTOR (J. Xu et al.)

34
Fold Recognition and Model Building

Protein Threading algorithms can only generate
sequence-template alignments
Correct templates should be chosen from the
template database
zScore, Neural Network, SVM
Some tools are needed to generate 3D structure of
the sequence from the sequencetemplate alignment
MODELLER (http//salilab.org/modeller/modeller.htm
l)
MaxSprout (http//www.ebi.ac.uk/maxsprout/)
Jackal (http//honiglab.cpmc.columbia.edu/programs
/jackal/intro.html)

35
CASP/CAFASP

CASP Critical Assessment of Structure Prediction
CAFASP Critical Assessment of Fully Automated
Structure Prediction

CASP Predictor
CAFASP Predictor

Wont get tired
High-throughput

36
CASP/CAFASP (contd)

Public
Organized by structure community
Evaluated by the unbiased third-party
Held every two years
Blind
Experimental structures to be determined by
structure centers after competition
Drawback lt100 targets
Blindness
Some centers are reluctant to release their
structures

37
Threading Model

Each template is parsed as a chain of cores. Two
adjacent cores are connected by a loop. Cores are
the most conserved segments in a protein.
No gap allowed within a core.
Only the pairwise contact between two core
residues are considered because contacts involved
with loop residues are not conserved well.
Global alignment employed

38
Contact Graph

Each residue as a vertex
One edge between two residues if their spatial
distance is within given cutoff.
Cores are the most conserved segments in the
template

template
39
Simplified Contact Graph
40
Contact Graph and Alignment Diagram
41
Contact Graph and Alignment Diagram
42
Calculation of Alignment Score
43
Hardness of Protein Threading

Protein Threading is NP-hard
Proof
Reduce Max-Cut problem to this problem
Given a graph, number its nodes by certain order.
Assume there are M nodes.
Consider a sequence of length 2M like this
PHPHPH
Pairwise score is 1 only if two different types
of residues are mapped to two ends of one graph
edge, otherwise 0

44
Linear Integer Program
maximize
z 6x5y
Linear function
Subject to
Linear Program
Integer Program
3xylt11 -x2ylt5 x, ygt0
Linear contraints
Integral contraints (nonlinear)
x, y integer
45
Linear Integer Program

Linear programs can be solved within polynomial
time
No polynomial time for integer programs so far
Relaxed to linear program, solve the linear
version
Branch-and-bound or branch-and-cut (may cost
exponential time)

46
Variables

x(i,l) denotes core i is aligned to sequence
position l
y(i,l,j,k) denotes that core i is aligned to
position l and core j is aligned to position k at
the same time.

47
Standard Formulation
a singleton score parameter b pairwise score
parameter
Each y variable is 1 if and only if its two x
variable are 1
Each core has only one alignment position
48
Better Formulation
a singleton score parameter b pairwise score
parameter
Each y variable is 1 if and only if its two x
variable are 1
Each core has only one alignment position

99 real threading instances generate integral
solutions directly
The fractional solution space of this formulation
is a subset of that of the previous one

49
Integrality of Real Instances
50
Correlation Coefficient between Fractional
Solutions and Templates, Sequences
edges the number of edges in the template
contact graph.
51
Integrality Summary

99 real instances could be solved by linear
programming directly, no additional
branch-and-bound needed
No special template or sequence found generates
more fractional solutions
No special feature of templates or sequences
found leads to fractional solutions
Explanation?

52
zScore to Fold Recognition

It is defined to be deviation of the alignment
score to the expected
Expected alignment score is calculated by random
shuffle
fixing the sequence-template alignment positions
Randomly shuffle the sequence and recalculate the
alignment score
Calculate the mean and variance of the alignment
scores generated by random shuffling
zScore equals to (mean-alignment score)/variance

53
SVM to Fold Recognition

SVM classification to classify threading pairs
Features
Template length, sequence length, alignment
length
Mutation score, fitness score, secondary
structure score, pairwise score
Gap penalty

54
CAFASP3/CASP5

Same Target Set 62 targets
Time for each target
Individual Servers 48 hours
Meta Servers 96 hours
CASP5 Predictors May to September of 2002
Resources for predictors
No X-ray, NMR machines (of course)
CAFASP3 predictors no manual intervention
CASP5 predictors anything (servers, google,)
Evaluation
CASP5 assessed by expertscomputer
CAFASP3 evaluated by MaxSub, a computer program.
Predicted structures are superimposed to the
experimental structures to see how long is
superimposable.

55
CASP5/CAFASP3 targets
Hard
Easy
Prediction Difficulty
CM Comparative Modelling, HM Homology
Modelling FR Fold Recogniton, NF New Fold
56
State of the Art
Servers with name in italic are meta servers
(http//ww.cs.bgu.ac.il/dfischer/CAFASP3,
released in December, 2002.)
57
RAPTORs sensitivity on FR targets
1. RAPTOR is weak at recognizing FR(A) targets
(need improvement ) 2. RAPTOR could not deal with
NF targets at all (normal)
58
CAFASP3 Example

Target Size144
Super-imposable size within 5A 118
RMSD1.9

Blue Correct Prediction
Red Experimental Structure
Green Incorrect Prediction
59
Term Project Options

Design a machine learning approach to fold
recognition
Design a template database from scratch, based on
only the PDB database, and compare it to other
databases
Protein secondary structure prediction
Local alignment for protein threading
Critical review of three protein threading
algorithms branch-and-bound, divide-and-conquer,
and linear programming

60
Open Questions

A practical and exact algorithm to protein
threading problem allowing gaps within cores
A practical and exact algorithm to ab initio
protein folding problem
Investigate the conditions under which the linear
program will generate integral solutions directly

61
Acknowledgements

Bonnie Berger, Introduction to Computational
Molecular Biology, course notes, 2001
Bin Ma, Bioinformatics, course notes, 2004

62
Reading List

CASP1, CASP2, CASP3, CASP4, and CASP5 Special
Issues, Proteins Structure, Function and
Genetics, 1995, 1997, 1999, 2001, 2003

Write a Comment

User Comments (0)

About PowerShow.com

CS882: Protein Structure Prediction PowerPoint PPT Presentation