Title: Efficient Nearest Neighbor Searching for Motion Planning
1Anna Yershova Department of Computer Science Duke
University February 5, 2010
Feb 5 2010, NC State University
Automated Protein Structure Determination using
RDCs
2Introduction
Motivation
Protein Structure Determination is Important
Amino acid sequences
Structures
Functions
Protein redesign
- High-resolution structures are needed for
- Determining protein functions
- Protein redesign
2
3Introduction
Motivation
What is Protein Structure Primary Structure
The sequence of amino acids forms the
backbone.Residues are sidechains attached to the
backbone.
3
Dihedral angle
Side chain
Amino acid
4Introduction
Motivation
What is Protein Structure Secondary Structure
Elements
Local folding is maintained by short distance
interactions.
4
5Introduction
Motivation
What is Protein Structure 3D Fold
Global 3D folding is maintained by more distant
interactions.
Alpha-helix
Side chain
Loop
Beta-strands
5
6Introduction
Motivation
High-Throughput Structure Determination Is
Important
The gap between sequences and structures
http//www.metabolomics.ca/News/lectures/CPI2008-s
hort.pdf
6
7Introduction
Motivation
Current Approaches for Structure Determination
- X-ray crystallography
- Difficulty growing good quality crystals
- Nuclear Magnetic Resonance (NMR) spectroscopy
- Difficulty lengthy (expensive) time in
processing and analyzing experimental data
Both require expressing and purifying proteins.
7
8Introduction
Motivation
Bruce Donalds Lab
Michael Zeng Chittu Tripathy
Lincong Wang
Pei Zhou
Bruce Donald
Cheng-Yu Chen John MacMaster
8
9Introduction
Motivation
Types of NMR Spectroscopy Data
4.2
R
Ha
NOE
133.1
172.1
B0
8.9
- Chemical shift (CS)
- Unique resonance frequency, serves as an ID
- Nuclear Overhauser effect (NOE)
- Local distance restraint between two protons
- Residual dipolar coupling (RDC)
- Global orientational restraint for bond vectors
9
10Introduction
Motivation
Resonance Assignment Problem
Assigning chemical shifts to each atom
10
Bailey-Kellogg et al., 2000, 2004
http//www.pnas.org/content/102/52/18890/suppl/DC1
11Introduction
Motivation
NOE Assignment Problem
Obtain local distance restraints between protons
A famous bottleneck
11
Bailey-Kellogg et al., 2000, 2004
12Introduction
Motivation
Structure Determination from NOEs
NOESY spectrum
Resonance assignments
NOE assignment
Assignment
Ambiguity
Distance Geometry
NP-Hard
Saxe 79 Hendrickson 92, 95
12
13Protein Structure Determination is Hard
Introduction
Motivation
Traditional Structure Determination Protocol
A famous bottleneck
13
14Protein Structure Determination is Hard
Introduction
Motivation
Traditional Structure Determination Protocol
error propagation
local minima
manual intervention for initial fold and for
evaluation of NOE assignments
A famous bottleneck
Can we have a poly-time algorithm using
orientational restraints?
Yes Wang and Donald, 2004 Wang et al, 2006
14
15Introduction
Motivation
Types of NMR Spectroscopy Data
4.2
R
Ha
NOE
133.1
172.1
B0
8.9
- Chemical shift (CS)
- Unique resonance frequency, serves as an ID
- Nuclear Overhauser effect (NOE)
- Local distance restraint between two protons
- Residual dipolar coupling (RDC)
- Global orientational restraint for bond vectors
15
16Background
RDCs
RDC Equation for a Single Bond
Alignment medium
?
b
B0
v
a
S Saupe Matrix S is traceless and symmetric S
contains 5 dofs
16
17Protein Structure Determination is Hard
Introduction
Motivation
Traditional Structure Determination VS RDC-Panda
RDC-PANDA Protocol
Constaint number of NOEs
RDCs
error propagation
RDC-ANALYTIC PACKER
local minima
Global Fold
manual intervention for initial fold and for
evaluation of NOE assignments
Sidechain Placement
NOE Assignments
XPLOR-NIH
NOE Assignments 3D Structures
17
Zeng et al. (Jour. Biomolecular NMR,2009)
18Introduction
Motivation
Importance of Backbone Structure Determination
Global orientational restraints from RDCs
Sparce data (high-throughput, large proteins,
membraine proteins)
Compute initial fold using exact solutions to
RDC equations
Avoid the NP-Hard problem of structure
determination from NOEs
Resolve NOE assignment ambiguity
Automated side-chain resonance assignment
18
19Introduction
Motivation
Current Limitations of RDC-Panda
- Because it requires only 2 RDCs per residue
- Only SSE elements can be reliably determined,
NOEs are needed to determine structure of loops - Difficulty in handling missing data
19
20Introduction
Motivation
My Current Project
- Improve current protein structure determination
techniques from our lab - Design new algorithms for protein backbone
structure determination using orientational
restraints from RDCs
20
21Introduction
Motivation
Literature Overview
- Distance geometry based structure determination
- Braun, 1987
- Crippen and Havel, 1988
- More and Wu, 1999
- Heuristic based structure determination
- Brünger, 1992
- Nilges et al., 1997
- Güntert, 2003
- Rieping et al., 2005
- RDC-based structure determination
- Tolman et al., 1995
- Tjandra and Bax, 1997
- Hus et al., 2001
- Tian et al., 2001
- Prestegard et al., 2004
- Wang and Donald (CSB 2004)
- Wang and Donald (Jour. Biomolecular NMR, 2004)
- Wang, Mettu and Donald (JCB 2005)
- Donald and Martin (Progress in NMR Spectroscopy,
2009 )
- Heuristic based automated NOE assignment
- Mumenthaler et al., 1997
- Nilges et al., 1997, 2003
- Herrmann et al., 2002
- Schwieters et al., 2003
- Kuszewski et al., 2004
- Huang et al., 2006
- Automated NOE assignment starting with initial
fold computed from RDCs - Wang and Donald (CSB 2005)
- Zeng et al. (CSB 2008)
- Zeng et al. (Jour. Biomolecular NMR,2009)
- Automated side-chain resonance assignment
- Li and Sanctuary, 1996, 1997
- Marin et al., 2004
- Masse et al., 2006
- Zeng et al. (In submission, 2009)
21
22Background
RDCs
RDC Equation for a Single Bond
Linear in S, A fixed v defines a hyperplane
Quadratic in v, A fixed S defines a hyperboloid
S
22
23Background
RDCs
RDC Equation for a Single Bond
1 RDC equation defines a collection of
hyperplanes, 7 variables
Linear in S, A fixed v defines a hyperplane
Quadratic in v, A fixed S defines a hyperboloid
S
23
24Background
RDCs
RDC Equations for a Protein Portion
24
25Background
RDCs
RDC Equations for a Protein Portion
1
2
3
4
u1
v1
v2
1 L. Wang and B. R. Donald. J. Biomol. NMR,
29(3)223242, 2004. 2 J. Zeng, J. Boyles, C.
Tripathy, L. Wang, A. Yan, P. Zhou, and B. R.
Donald J. Biomol. NMR, Epub ahead of print
PMID19711185, 2009.
Too few equations, too many variables!
25
26Background
RDCs
Forward Kinematics Reduces the Number of Variables
v1
Fix coordinate system.
v2
u1
26
27Background
RDCs
RDC Equations for a Protein Portion
v1
v2
u1
27
28Background
RDCs
RDC Equations for a Protein Portion
Recursive representation is possible!
28
29Background
RDCs
One Equation Per Dihedral Angle is Not Enough!
- Each equation is linear in S, and quartic in
either tan(?) or tan(?) - To be able to solve this system there must be
additional information - Possible scenarios
- Additional RDC measurement(s) for each dihedral
angle. - Additional alignment media.
- Additional NOE data.
- Modeling (Ramachandran regions, steric clashes,
energy function) - Sampling (for alignment tensors)
29
30Background
RDC-Panda
The RDC-PANDA Structure Determination Package
- Current requirements
- 2 RDCs per residue to obtain SSE structures
- Sparse NOEs to pack the SSEs
- Current bottlenecks
- Missing data (even in long SSEs)
- Long loops
- Sampling for computing alignment tensor(s)
- Sampling for the orientation of the first pp
1 L. Wang and B. R. Donald. J. Biomol. NMR,
29(3)223242, 2004. 2 J. Zeng, J. Boyles, C.
Tripathy, L. Wang, A. Yan, P. Zhou, and B. R.
Donald J. Biomol. NMR, Epub ahead of print
PMID19711185, 2009.
30
31Background
RDC-Panda
When Saupe Matrix is Known Solution Can Be Found
Exactly!
Ellipse equations for CH bond vector
Wang Donald, 2004
Donald Martin, 2009.
32Solution Structure of FF Domain 2 of human
transcription elongation factor CA150 (FF2) using
RDC-PANDA
Background
RDC-Panda
Solution Structure Deposited Using RDC-Panda
PDB ID 2KIQ
In collaboration with Dr. Zhous Lab
32
33Current Project
Problem Formulation NH, CH RDCs in 2 Media
We require measurements for at least 9
consecutive bond vectors (4.5 residues) in 2
media. The goal is to handle more equations and
errors.
33
34Current Project
Relationship to Minimization
34
35Current Project
Relationship to Minimization and SVD
Solving an over constrained system of linear
equations is equivalent to finding a projection
of the b vector on the A hyperplane. This is also
equivalent to minimizing the least square
function of the terms.
35
36Current Project
Relationship to Minimization
36
37Current Project
Relationship to Minimization and SVD
b
A(?i ?i)
s
Solving such a system of non-linear equations is
not trivial! There are multiple local minima in
the corresponding minimization problem.
37
38Current Project
Advantages
- If the minimization problem is solved then
- Computation of packed SSEs and loops is possible
without additional NOE data. - Saupe matrices for each of the alignment medium
can be computed without sampling. - Robust handling of missing values
38
39Current Project
The Algorithm Initialization Using Helix
Initialize (?i,?i) for a helix
Compute initial approximation for Si using SVD
Compute (?i,?i) using tree search and
minimization
Update Si using SVD
39
40Current Project
The Algorithm Protein Portion
Initialize Si to computed approximations
Compute (?i,?i) using tree search and
minimization
Update Si using SVD
40
41Current Project
The Algorithm Computing Dihedrals
?1
Minimize each of the RMSD terms as a univariate
function.
?1
x
x
?n
x
?n
Iteratively minimize the RMSD function
x
Compute the list of best solutions.
41
42Current Project
Advantages
- The algorithm is converging, since every step
minimizes RMSD function - If the data was perfect then the solution to
the minimization problem would be the roots of
the polynomials in the RMSD terms, and the
algorithm would find ALL of them. - The minima of the RMSD terms give a good
collection of initial structures for finding
local and global minima - Robust handling of missing values
42
43Preliminary Results
Preliminary Results Ubiquitin Helix
Conformation of the portion 25-31 of the helix
for human ubiquitin computed using NH and CH RDCs
in two media (red) has been superimposed on the
same portion from high-resolution X-ray structure
(PDB Id 1UBQ) (green). The backbone RMSD is 0.58
Å.
Protein RMSD (Hz) Alignment Tensor (Syy, Szz)
Ubq ?25-31 C?H? 0.32 NH 0.24 (23.66, 16.48) (53.25, 7.65)
43
44Preliminary Results
Preliminary Results Ubiquitin Strand
Conformation of the portion 2-7 of the
beta-strand for human ubiquitin computed using NH
and CH RDCs in two media has been superimposed on
the same portion from high-resolution X-ray
structure (PDB Id 1UBQ). The backbone RMSD is
1.151 Å.
Protein RMSD (Hz) Alignment Tensor (Syy, Szz)
Ubq beta 2-7 C?H? NH (53.32, 4.83) (48.03, 14.32)
44
45Conclusions
- Complete and exhaustive search over the space of
all structures minimizing the RDC fit function
seems feasible due to understanding the structure
of the solution. - Possible and exiting extensions to more/different
data
Funding NIH
Thank you!
45
46Comparison
Sparse
Accuracy
Data requirements vs. Accuracy (Ubiquitin)
46