Predicting and Classifying Protein Structures - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Predicting and Classifying Protein Structures

Description:

RMSD = 0.2 dream on. A Good Protein Structure.. X-ray ... Results posted on the web and updated weekly. http://cubic.bioc.columbia.edu/eva. Lecture 3.2 ... – PowerPoint PPT presentation

Number of Views:188

Avg rating:3.0/5.0

Slides: 53

Provided by: MichelDu5

Category:

more less

Transcript and Presenter's Notes

Title: Predicting and Classifying Protein Structures

1
Predicting and Classifying Protein Structures

Michel Dumontier, Ph.D.
Carleton University
michel_at_bioinfocg.com

2
Outline

3D Structure Determination
Validation
Structure Classification
Structure Prediction
Secondary Structure

3
Structure Validation

A structure can (and often does) have mistakes
A poor structure will lead to poor models of
mechanism or relationship
Unusual parts of a structure may indicate
something important (or an error)

4
Famous bad structures

Azobacter ferredoxin (wrong space group)
Zn-metallothionein (mistraced chain)
Alpha bungarotoxin (poor stereochemistry)
Yeast enolase (mistraced chain)
Ras P21 oncogene (mistraced chain)
Gene V protein (poor stereochemistry)

5
Structure Validation

Assess experimental fit
look at Resolution, R-Factor or RMSD
Assess correctness of overall fold
look at disposition of hydrophobic residues
Assess structure quality
packing
stereochemistry
contacts...

6
X-Ray Resolution

Resolution Meaning
gt4.0 Coordinates meaningless.
3.0 - 4.0 Fold possibly correct, but errors
are very likely. Many sidechains placed with
wrong rotamer.
2.5 - 3.0 Fold likely correct except that some
surface loops might be mis-modelled. Several
long, thin sidechains (lys, glu, gln, etc) and
small sidechains (ser, val, thr, etc) likely to
have wrong rotamers.
2.0 - 2.5 As 2.5 - 3.0, but number of sidechains
in wrong rotamer is considerably less. Many
small errors can normally be detected. Fold
normally correct and number of errors in
surface loops is small.
1.5 - 2.0 Few residues have wrong rotamer. Many
small errors can normally be detected. Fold
always correct, also in surface loops.
0.5 - 1.5 Threonines may have wrong
chirality on the C-beta.

7
A Good Protein Structure..
X-ray structure NMR structure

R 0.59 random chain
R 0.45 initial structure
R 0.35 getting there
R 0.25 typical protein
R 0.15 best case
R 0.05 small molecule

RMSD 4 Å random
RMSD 2 Å initial fit
RMSD 1.5 Å OK
RMSD 0.8 Å typical
RMSD 0.4 Å best case
RMSD 0.2 Å dream on

8
A Good Protein Structure..

Minimizes disallowed torsion angles
Maximizes number of hydrogen bonds
Maximizes buried hydrophobic ASA
Maximizes exposed hydrophilic ASA
Minimizes interstitial cavities or spaces

9
A Good Protein Structure..

Minimizes number of bad contacts
Minimizes number of buried charges
Minimizes radius of gyration
Minimizes covalent and noncovalent (van der Waals
and coulombic) energies

10
Structure Validation Servers

WHAT IF
http//swift.cmbi.kun.nl/WIWWWI/
Verify3D
http//www.doe-mbi.ucla.edu/Services/Verify_3D/
VADAR
http//redpoll.pharmacy.ualberta.ca

11
(No Transcript)
12
(No Transcript)
13
Structure Validation Programs

PROCHECK
http//www.biochem.ucl.ac.uk/roman/procheck/proch
eck.html
VADAR
http//www.pence.ca/software/vadar/latest/vadar.ht
ml
DSSP
http//www.cmbi.kun.nl/gv/dssp/

14
Procheck
15
Outline

3D Structure Determination
Validation
Structure Classification
Structure Prediction
Secondary Structure

16
Domains are ubiquitous in proteins
Large proteins are composed of compact,
semi-independent units - domains.
Reason Modularity Folding efficiency
2MCP.PDB
17
Protein Domains an alphabet of functional
modules
18
SCOP

The SCOP database aims to provide a detailed and
comprehensive description of the structural and
evolutionary relationships between all proteins
whose structure is known.
Created by manual inspection and aided by
automated methods
Consists of four hierarchical categories
Class, Fold, Superfamily and Family.
http//scop.mrc-lmb.cam.ac.uk/scop

19
structural classification
The eight most frequent SCOP superfolds
20
Semi-automated consensus domain definition -
Structure (CATH)
Dehydrolipoamide dehydrogenase 1LPFA
http//www.biochem.ucl.ac.uk/bsm/cath/
Jones S et al. (1998) Domain assignment for
protein structures using a consensus approach
Chracterization and analysis. Protein Science
7233-242
21
CATH - Class
Class 4 Few Secondary Structures
Class 2 Mainly Beta
Class 3 Mixed Alpha/Beta

Class 1 Mainly Alpha

Secondary structure content (automatic)
22
CATH - Architecture

Roll

Super Roll
Barrel
2-Layer Sandwich
Orientation of secondary structures (manual)
23
CATH - Topology

L-fucose Isomerase

Serine Protease
Aconitase, domain 4
TIM Barrel
Topological connection and number of secondary
structures
24
CATH - Homology

Alanine racemase

Dihydropteroate (DHP) synthetase
FMN dependent fluorescent proteins
7-stranded glycosidases
Superfamily clusters of similar structures
functions
25
Conserved Domain Database
Automated (objective) domain definition using
sequence.
CDD from Smart and Pfam CDART from CDD and
Genbank
http//www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtm
l
26
Homologous domains have similar structures
1PLS/2DYN 23 ID
1PLS - PH domain (Human pleckstrin)
2DYN - PH domain (Human dynamin)
27
Homology and Structural Similarity
Proteins that diverge in evolution maintain their
global fold !
Russell et al. (1997) J Mol Biol 269 423-439
28
Superposition

Important as a means to identify protein motifs
and fold families
Non-evolutionary structural relationships

Structural similarity between Calmodulin and
Acetylcholinesterase
29
RMSD metric
To calculate the RMSD, a pairwise correspondence
of points has to be defined first.
30
RMSDopt
RMSDopt min(RMSDcoord)
RMSDopt RMSDcoord(A, Rs x (B-Ts))
The translation vector Ts and the rotation matrix
Ms define a superposition of the vector set B on
A.
An analytic solution of the superposition problem
is available, but not straightforward (involves
an eigenvalue problem).
31
Superposition in practice

Pre-aligned structures
VAST www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtm
l
FSSP www.bioinfo.biocenter.helsinki.fi8080/dali/i
ndex.html
Homstrad www-cryst.bioc.cam.ac.uk/homstrad/
PDBsum www.biochem.ucl.ac.uk/bsm/pdbsum/
DALI www.ebi.ac.uk/dali/
On the fly
CE cl.sdsc.edu/ce.html
FAST biowulf.bu.edu/FAST/

32
Outline

3D Structure Determination
Validation
Structure Classification
Structure Prediction
Secondary Structure

33
Secondary (2o) Structure
34
Secondary Structure Prediction

One of the first fields to emerge in
bioinformatics (1967)
Grew from a simple observation that certain amino
acids or combinations of amino acids seemed to
prefer to be in certain secondary structures
Subject of hundreds of papers and dozens of
books, many methods

35
2o Structure Prediction

Statistical (Chou-Fasman, GOR)
Homology or Nearest Neighbor (Levin)
Physico-Chemical (Lim, Eisenberg)
Pattern Matching (Cohen, Rooman)
Neural Nets (Qian Sejnowski, Karplus)
Evolutionary Methods (Barton, Niemann)
Combined Approaches (Rost, Levin, Argos)

36
Secondary Structure Prediction
37
Chou-Fasman Statistics
38
Simplified C-F Algorithm

Select a window of 7 residues
Calculate average Pa over this window and assign
that value to the central residue
Repeat the calculation for Pb and Pc
Slide the window down one residue and repeat
until sequence is complete
Analyze resulting plot and assign secondary
structure (H, B, C) for each residue to highest
value.

39
Simplified C-F Algorithm
helix
beta
coil
10 20 30 40
50 60
40
Limitations of Chou-Fasman

Does not take into account
long range information (gt3 residues away)
structure class
Does not include
related sequences or alignments in prediction
process
Only about 55 accurate

41
The PhD Algorithm

Search the SWISS-PROT database and select high
scoring homologues
Create a sequence profile from the resulting
multiple alignment
Include global sequence info in the profile
Input the profile into a trained two-layer neural
network to predict the structure and to
clean-up the prediction

42
Prediction Performance
43
Best of the Best

PredictProtein-PHD (72)
http//cubic.bioc.columbia.edu/predictprotein/
Jpred (73-75)
http//www.compbio.dundee.ac.uk/www-jpred/submit.
html
SAM-T02 (75)
http//www.cse.ucsc.edu/research/compbio/HMM-apps/
T02-query.html
PSIpred (77)
http//bioinf.cs.ucl.ac.uk/psipred/psiform.html

44
(No Transcript)
45
Evaluating Secondary Structure Predictions

Historically problematic due to tester bias
(developer trains and tests their own
predictions)
Some predictions were up to 10 off
Move to make testing independent and test sets as
large as possible
EVA evaluation of protein secondary structure
prediction

46
EVA

gt10 different methods evaluated as new structures
are deposited in the PDB
Results posted on the web and updated weekly
http//cubic.bioc.columbia.edu/eva

47
EVA
48
Secondary Structure Evaluation

Q3 score
standard method in evaluating performance, 3
states (H,C,B) evaluated like a multiple choice
exam with 3 choices. Same as correct
SOV (segment overlap score)
more useful measure of how segments overlap and
how much overlap exists

49
Homology Modeling

Similar sequences usually share the same fold.
Structure models can be constructed from
alignments with proteins having a 3D structure.
When no suitable template structure can be found,
possible templates are found using threading
More with Boris in 3.3 and 3.5

50
ab initio Protein Structure Prediction

Predicting the 3D structure without any prior
knowledge
Used when homology modeling or threading have
failed (no homologues are evident)
Equivalent to solving the Protein Folding
Problem
Still an active research problem
Howards Lecture 5.2

51
Conclusions

Protein structures are now sufficiently abundant
and well defined that they can be classified
using well-developed rules of taxonomy
Distant relationships and common rules of folding
can be uncovered through fold classification
comparison

52
Conclusions

Structure prediction is still one of the key
areas of active research in bioinformatics and
computational biology
Significant strides have been made over the past
decade through the use of larger databases,
machine learning methods and faster computers

Write a Comment

User Comments (0)