Title: Lecture 9.2: Homology and Structural Similarity (What do when you have no structure ...)
1Lecture 9.2Homology and Structural
Similarity(What do when you have no structure
...)
- Boris Steipe
- boris.steipe_at_utoronto.ca
http//biochemistry.utoronto.ca/steipe - Departments of Biochemistry and Molecular and
Medical Genetics - Program in Proteomics and Bioinformatics
- University of Toronto
- (This lecture is based in part on a lecture held
by Chris Hogue, Toronto, for CBW in 2002)
2Concepts
- Domains are folding units, functional units and
units of inheritance. - Homologous domains have similar structure.
- Structural similarity can be measured and similar
domains can be retrieved from databases. - Detection of similar folds can provide
mechanistic explanations. - Threading methods can sometimes find similar
folds. - Ab initio predictions of structure are highly
experimental.
3Concept 1
- Domains are
- folding units, functional units, and units of
inheritance.
4Domains as units of inheritance - the PH domain
story
Dotlet - A dotplot of Pleckstrin (p47) reveals
similarity between N-and C terminus !
5Domains as units of inheritance - the PH domain
story
Matrix EBLOSUM62 Gap_penalty 10.0
Extend_penalty 0.5 Length 100 Identity
31/100 (31.0) Similarity 48/100
(48.0) Gaps 6/100 ( 6.0) 6
IREGYLVKKGSVFNTWKPMWVVLLEDG--IEFYKKKSDNSPKGMIPLKGS
53 ............
............ 245 IKQGCLLKQGHRRKNWKVRKFIL
REDPAYLHYYDPAGAEDPLGAIHLRGC 294 54
TLTSPCQDFGKRMF----VFKITTTKQQDHFFQAAFLEERDAWVRDINKA
99 ......... ........
..... 295 VVTSVESNSNGRKSEEENLFEIITADEVHYF
LQAATPKERTEWIKAIQMA 344
Emboss - Optimal sequence alignment 31 identity
over 100 amino acids.
6Domains as units of inheritance - the PH domain
story
Matrix EBLOSUM62 Gap_penalty 10.0
Extend_penalty 0.5 Length 100 Identity
31/100 (31.0) Similarity 48/100
(48.0) Gaps 6/100 ( 6.0) 6
IREGYLVKKGSVFNTWKPMWVVLLEDG--IEFYKKKSDNSPKGMIPLKGS
53 ............
............ 245 IKQGCLLKQGHRRKNWKVRKFIL
REDPAYLHYYDPAGAEDPLGAIHLRGC 294 54
TLTSPCQDFGKRMF----VFKITTTKQQDHFFQAAFLEERDAWVRDINKA
99 ......... ........
..... 295 VVTSVESNSNGRKSEEENLFEIITADEVHYF
LQAATPKERTEWIKAIQMA 344
!
-C
N-
Human p47
-C
N-
Human p47
Overlapping alignments may define domain
boundaries ! We can search a database with this
knowledge ...
7Domains as units of inheritance - the PH domain
story
-C
N-
Human p47
Hits are smoothly bounded and extend over the
entire domain.
486 hits ... etc.
8Domains as units of inheritance - the PH domain
story
in contrast ...
-C
N-
Human p47
Hits extend over the entire domain. PSI Blast
would be difficult ...
(Yeast only, for clarity)
9Concept 2
- Homologous domains have similar structure.
10Homologous domains have similar structures
1PLS/2DYN 23 ID
1PLS - PH domain (Human pleckstrin)
2DYN - PH domain (Human dynamin)
11Homology and Structural Similarity
Proteins that diverge in evolution maintain their
global fold !
Russell et al. (1997) J Mol Biol 269 423-439
12Concept 3
- Structural similarity can be measured and similar
domains can be retrieved from databases.
13RMSD metric
To calculate the RMSD, a pairwise correspondence
of points has to be defined first.
14RMSDopt
RMSDopt min(RMSDcoord)
RMSDopt RMSDcoord(A, Rs x (B-Ts))
The translation vector Ts and the rotation matrix
Ms define a superposition of the vector set B on
A.
An analytic solution of the superposition problem
is available, but not straightforward (involves
an eigenvalue problem).
15Superposition in practice
- Prealigned structures
- VAST (http//www.ncbi.nlm.nih.gov/Structure/VAST/v
ast.shtml) - FSSP (http//www.bioinfo.biocenter.helsinki.fi808
0/dali/index.html) - Homstrad (http//www-cryst.bioc.cam.ac.uk/homstra
d/)
60 70 80
90 100 1dro ( 32 )
wdkVyMaAkAG-------rIsFykd-qkgyk----------snpelTfrg
1btn ( 23 ) whnVyCvin-------nqeMgFykd-aksaa
----------sg--ipYh s1pls ( 21 )
wkpmwVVLle-------dgIeFykk-ksdn---------------spk--
1fgya ( 281 ) wkrrwFiLTd-------ncLyYFey-ttdk-
--------------epr-- 1faoa ( 181 )
wktrwFtLhr-------neLkYfkd-qm sp---------------epi-
- 1qqga ( 25 ) mhkrFFVLraaseaggparLEyYen-ekkw
r----------hkssapk-- 1bak ( 576 )
wqrryFyLfp-------nrlewrge----------------geap-----
1dyna ( 30 ) skeYwFvLta-------enLsWykd-deek-
--------------ekk-- 1dbha ( 456 )
kherhIFLFd--------gLICCksnhgqprl--------pgasnaeyrL
1b55a ( 25 ) fkkrlFlLtv-------hkLsYyeydfe--
r----------grrgskk-- 1mai ( 37 )
rreRfYkLqe-----dcktIwqesr-kv-----------------mrspe
1fhoa ( 25 ) pKlRyVfLfr-------nkimFtEqd---as
t--------s---ppsyth 1foea (1288 )
ePeLaAfVFk-------tAVVLVykdgskqkkklvgshrlsiyeewdpfr
bbbbbb bbbbb
16Superposition in practice
- Web services
- VAST (http//www.ncbi.nlm.nih.gov/Structure/VAST/v
ast.shtml) - CE (http//cl.sdsc.edu/ce.html)
- LGA (http//predictioncenter.llnl.gov/local/lga/lg
a.html) - Prosup (http//lore.came.sbg.ac.at8080/CAME/CAME_
EXTERN/PROSUP/)
(Note Click on "Rasmol" on the results page to
return the alignment)
Useability and reliability of these services is
variable. "Intelligent" algorithms can
superimpose without the need for user definition
of correspondence. The downside is that the user
cannot define correspondences.
17Superposition in practice - locally installed
- Many molecular modeling programs have
superposition features - DeepView (http//ca.expasy.org/spdbv/)
- MolMol (http//www.mol.biol.ethz.ch/wuthrich/softw
are/molmol/) - O (http//alpha2.bmc.uu.se/alwyn/o_related.html)
- WhatIf (http//www.cmbi.kun.nl/whatif/)
18When is RMSD misleading ?
- Rigid body movement of domains or subdomains ...
?
19Internal coordinates as an alternative to
superposition
a
a'
c'
b'
b
c
(a,a')
(b,b')
(c,c')
20VAST - Database searches at MMDB
21DALI ...
22... and FSSP
The prealigned fold-tree
23Workflow MMDB ...
Open http//www.ncbi.nlm.nih.gov/ enter your
search term ...
24Workflow MMDB ...
Choose "Structure" ...
25Workflow MMDB ...
Choose your protein of interest ...
26... structure summary ...
27... access domains similar to SH3 ...
28... select, download ...
29... display.
30Concept 4
- Detection of similar folds can provide
mechanistic explanations.
31Protein Modules
Modular interactions between biomolecules are
responsible for the inner workings of the
cell. There are far more modular interacting
proteins than classical enzymes in the human
genome we have known this since S. cerevisiae.
Pawson Lin
32Protein Domains an alphabet of functional
modules
33Workflow for domain architectures
Starting from a citation ...
34... access sequence ...
35... display sequence ...
36... link to domain architecture ...
(from CDDdatabase - incl. SMART and Pfam)
37... show domain relatives ...
38... access domain information ...
39... in CDD ...
40... visualize in Cn3D.
41Protein structure prediction
- What to do when no structure is known and no
homologues are found ?
42Three Paths to Protein Structure Prediction
- Homology Modeling
- Threading (Fold recognition)
- Ab initio prediction
43Concept 5
- Threading methods can sometimes find similar
folds.
44Fold recognition ("Threading")
Template Structure
Query Sequence
Query Sequence
Query Sequence
Query Sequence
45Threading Database Search
- Premise is that most sequences match some 3-D
structure that is already known (1/2) - Given a database of known 3-D protein folds
- align the test sequence to each known protein
- in real 3-D coordinate space (slow but exact)
- in parameterized 1-D space (fast but approximate)
- optimize some scoring function
- sort out best sequence-structure alignment
- assess alignments - statistically significant?
46Threading Statistics
- Z score (sequence composition correction)
- number of standard deviations the found alignment
is off from the mode of a randomized version of
the structure or profile - P value (sequence length correction)
- Shuffle the sequence - make a distribution of
random threads - Is the unscrambled thread any better than a
randomly optimized sequence - Z score of Z scores
- Look for P values as a criterion for choosing a
threading method...
47Database Searching...
- Sensitivity
- High sensitivity implies finding all possible
true positive matches in the database - Specificity
- High specificity implies finding no false
positive matches in the search.
48Threading as a Database Search Method
- Has INCREDIBLY poor sensitivity
- 10-20 on a good day
- Has INCREDIBLY poor specificity.
- 90 of hits are false positives
- So...
49Interpret Threading Accordingly...
- In a ranked list of 10 matches, expect that only
one might be correct - Expect that none may be correct
- Expect that the top ranked hit is a false
positive...
50How then does Threading find things?
- If there is a true positive in a threading search
hit list - People find it ... - It is most often found by FUNCTIONAL similarity.
- Similar enzymatic mechanisms
- Motifs, DART ...
- Similar roles, cellular distributions ...
51Concept 6
- Ab initio predictions of structure are highly
experimental.
52Protein structure prediction is easy
The assumption Native structure is a global
energy minimum
- The algorithm
- Reasonably generate all conformations
- Score with an appropriate scoring function
- Choose the one with best score
reasonable search finishes in reasonable
time appropriate monotonous with q (or at
least)DG, useful radius of convergence
53Why is structure prediction hard ?
- Appropriate scoring functions
- Reasonable structure generation
- Working approaches
54Protein structure scoring functions
Molecular Mechanics Empirical
(Statistical) Combinations
The scoring function is the single most important
component of any optimization !
55Protein structure scoring functions
bonds
Molecular Mechanics Empirical
(Statistical) Combinations
angles
dihedrals
Van der Waals
Coulomb
56Protein structure scoring functions
Energy of state i
Molecular Mechanics Empirical
(Statistical) Combinations
Frequency
Partition function
Frequency of observation of a,b at separation x
All observations of a,b
Potential energy between a,b at separation x
57Protein structure scoring functions
Molecular Mechanics Empirical
(Statistical) Combinations
Usually combine potential energy and empirical
solvation terms
58Why is structure prediction hard ?
- Appropriate scoring functions
- Reasonable structure generation
- Working approaches
59Combinatorially large search spaces make
enumeration impossible.
- Consider
- 100 residues
- 3 states
- 3100 1047 conformations
60A Blind Golfer's view of global optimization I
How do you hit a hole-in-one, when you can't even
see the hole ? How do you hit 18 holes-in-one in
a row ?
61A Blind Golfer's view of global optimization II
Change the shape of the golf course !
62An analysis of why the Blind Golfer's strategy
works
a
b
Local improvements in position (a) lead to
incremental improvements in energy (b) !!!
63How does nature fold proteins ?
The funnel model reconciles the thermodynamic and
the kinetic view !
q
DG
In a flat folding landscape, a thermodynamic
minimum is kinetically inaccessible.
An ideal funnel results in fast, two-state
folding through many possible pathways.
But ...
Dill KA Chan HS (1997) From Levinthal to
pathways to funnels. Nature Struct Biol 410-19
64How does nature fold proteins ?
Real folding landscapes appear to be more complex
- robust folding is possible, but so are
populated intermediate states and kinetic traps.
What does this mean for promising computational
strategies ? To the degree that folding is under
Thermodynamic control Direct inference of
structure is possible To the degree that folding
is under Kinetic control Simulation of folding
pathway is required
Dill KA Chan HS (1997) From Levinthal to
pathways to funnels. Nature Struct Biol 410-19
65How to solve hard problems
- Simplification
- Brute force
- Branch and bound
- Heuristics
- Local optimization
- Simulated annealing
- Genetic algorithms
- Neural networks
66Is structure prediction NP hard ?
Not necessarily nature does it in P.
A problem that is NP-hard in principle, can be P
in practice. This is the significance of the
protein folding funnel. Search for local
solutions - subproblems !
67Why is structure prediction hard ?
- Appropriate scoring functions
- Reasonable structure generation
- Working approaches
68Ab initio prediction
Isites Sequence - structure motifs
HMMSTR Hidden Markov Model 2-structure
prediction
Rosetta Monte carlo fragment move based
structure generation, Bayesian conditional
probability scoring function
Bystroff, C. Shao, Y. (2002) Fully automated
protein structure prediction using ISITES, HMMSTR
and ROSETTA. Bioinformatics 18 S1 S54-S61
69Ab initio prediction
What can you expect ?
50 residues lt 6Å RMSD 20 of proteins
globally topologically correct 60 of proteins
with partially topologically correct substructures
RMSD 5.9Å
RMSD 5.9Å
RMSD 5.9Å
Bystroff, C. Shao, Y. (2002) Fully automated
protein structure prediction using ISITES, HMMSTR
and ROSETTA. Bioinformatics 18 S1 S54-S61
70An ab initio Predictionserver on the WWW
http//robetta.bakerlab.org
71Open Issues
- Scoring functions
- radius of convergence ...
- Workflow
- what will you do with the results ?