Protein Structure Comparison presentation

About This Presentation

Transcript and Presenter's Notes

Title: Protein Structure Comparison

1
Protein Structure Comparison
2
Protein Structure Alignment
Human Hemoglobin alpha-chain pdb1jebA
Human Myoglobin pdb2mm1
Another example G-Proteins 1c1yA,
1kk1A6-200 Sequence id 18 Structural id 72
3
Protein Structure ComparisonMotivation

Understanding of the folding process.
Protein classification
Finding binding sites of the protein
Identifying structurally conserved regions in the
protein

4
Comparison of proteins by sequence or shape?

Protein sequence comparison is simpler
3D structures are available in a few percent of
all proteins
Shape similarity is detectable even though the
sequences may have changed in the course of
evolution

5
Different Instances of Structure Comparison

All-to-All comparison
Classify all known structures
Search for a structural motif
Study interaction between structures and other
molecules (Protein Docking)
Use known structures to predict structure from
sequence (Protein Threading)

6
Sequence order dependence

Sequence order dependent alignment
an 1-D task.
Sequence order independent alignment
a real 3-D task.

7
Why Sequence order dependence

Substructures preserving sequence order might be
biologically more meaningful
With the sequence order constraint the
computational task is simpler

8
The problem

Given a pair of protein structures, find the
correspondences between the Ca atoms of the
backbone that best align the two structures
Tradeoffs between the number of corresponding
atoms and the lowest distance

9
Finding Correspondences

Point-based approaches
Geometric Hashing, Indexing (Wolfson et al.,
1998)
Comparison based on 2D or 3D distance matrices
(Holm, Sander, 96)
Dynamic Programming (Gerstein, Levitt, 98)
Combinatorial Extension (Bourne et al, 96)

10
Algorithms for Structure Alignment

Distance based methods
DALI(Holm and Sander) Aligning scalar distance
plots
STRUCTAL(Gerstein and Levitt) Dynamic
programming using pairwise inter-molecular
distances
SSAP(Orengo and Taylor) Dynamic programming
using intra-molecular vector distance
Vector based methods
VAST (Bryant) Graph theory based secondary
structure alignment
3dSearch (Singh and Brutlag) Fast secondary
structure index lookup
Both vector and distance based
LOCK (Singh and Brutlag) Hierarchically uses
both secondary structures vectors and atomic
distances

11
Hashing function

From an Object
To invariant Features
To t-ple of numbers
To indeces
Use the t indeces to access a t-dimensional hash
table

12
Indexing Methodsfor Fast retrieval of 3D
patterns

Select a set of target proteins
Create and store a hash table indexed by
invariant geometric properties of the selected
folds
Update the databases as new structures are found
Use the table to identify the nearest fold for a
target protein.

13
Reference Frame

A 3-D reference frame (r. f.) can be defined by
three non collinear points
Invariant
the coordinates of any other point in the r.f.

14
Secondary Structures Representation

Secondary structures are represented as linear
vectors (segments)
the axis for the alpha helix and the best fit
segment for a beta strand
A SVD-based alignment algorithm is used to match
an a helix segments with known axes to determine
helix axis. Direct segment fits were made to fit
b-sheet strands.

15
Visualization

Each segment associated to a secondary structure
is displayed as a cylinder

16
Secondary Structure-based Approaches

Geometric Hashing, Indexing (Wolfson et al, Holm
et al, Guerra et al )
Graph-based (Grindley et al)
Dynamic Programming (Singh, Brutlag)

17
Indexing techniques based on Secondary
Structures(Guerra et al)

Consider all the triplets of secondary structures
and their associated segments
Construct a 3D table indexed by the angles
relating three secondary structures.

18
Table Construction

For each triplet a1 , a2 , a3 of secondary
structures of protein P
compute the angles between
(a1 , a2 ), (a1 , a3 ), and (a2 , a3 ),
and use them as indexes to an entry in the a-a-a
Table where (P, (a1 , a2 , a 3)) is stored.
Each cell of the table at the end contains
information about all triplets that hashed into
it (including distances between secondary
structures)

19
Table construction

Time Complexity O(s3n)
s is the of secondary structures in a
protein
n the of proteins.

20
Searching the table

For a query protein, compute the same invariants
used for the target proteins.
For each invariant and corresponding indeces,
access the corresponding cell in the table where
a vote is cast
List the target proteins according to their votes

21
Distribution of table entries(D. Platt, C.
Guerra, I. Rigoutsos, G. Zanotti, 2003)

There is a strong preference for triplets to fall
into cells with indexes a,b, g satisfying
a b g
corresponding to segments lying on parallel planes

22
Analysis of Distribution of globally selected
secondary structures

Distributions show much stronger preference for
alignment than expected for randomly uniform
vectors.
There is a greater preference for alignment
between any two secondary structure elements if a
third structure element aligns with either of the
first two -- the alignment angles are not
independent variates.

23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Geometric Indexing
PDB contains 27000 proteins
Hash Table
Query protein
Proteins Superposition
Search for Similarity
Hypotheses of Similarity
List of Similar Proteins
Pairwise atomic Superimposition with a selected
protein
Alignment by Dynamic Programming
27
Refinement of the matching procedure

Alignment
Find a collection of corresponding pairs of
secondary structures (SS) which maximizes a
given similarity measure
Dynamic programming

28
D(i,j) number of times SS i of protein A is
associated to SS j of protein B in a triplet of
equivalent SS
29
Problem

Find the increasing path in the D(i,j) matrix
that maximizes the total similarity measure

30
Dynamic Programming

Let M be a 2D matrix such that M(i,j) is the
similarity measure between s1 s2 .... si and
t1t2 .... tj
Compute
M(i,j) max M(i-1,j) d(ti , f),
M(i-1,j-1) d(si tj),
M(i,j-1) d(f, s j )
The solution
D(A,B)M(n,m)
Quadratic time complexity

31
Integration of matching strategies

Using different protein representations at
atomic level
secondary structure level
sequence level

32
Superposition

Find a rigid transformation which optimally
superimposes the atoms of two proteins
Horn method

33
Accuracy vs coverage

Accuracy how many of the solutions found were
correct?
A F intersection T /F
Coverage How many of the correct solutions were
found?
C F intersection T /T

T correct sol.
Fsolutions found
34
Evaluating PROuST

Using as standard of truth SCOP
Against other existing servers

35
Accuracy of results for 1tim chain A
36
Another Algorithm DALI

Based on aligning 2-D intra-molecular distance
matrices
Computes the best subset of corresponding
residues from the two proteins such that
similarity between the 2-D distance matrices is
maximized.
Searches through all possible alignments of
residues using Monte-Carlo algorithms

37
DALI
38
Distance matrix (2)

Advantages
- invariant with respect to rotation and
translation
- can be used to compare proteins
Disadvantages
- the distance matrix is O(n2) for a protein
with n residues
- comparing distance matrix is a hard problem
- insensitive to chirality

39
Distance matrix
5.9
2
4
8.1
3
6.0
1
40
DALI

DALI has been used to do an ALL vs. ALL
comparison of proteins in the PDB, and to create
a hierarchical clustering of families.
FSSPFold classification based on
Structure-Structure alignment of Proteins
http//www.ebi.ac.uk/dali/fssp/fssp.html

41
VAST-Vector Alignment Search Tool

Aligns only secondary structure elements (SSE)
Represents each SSE as a vector
Finds all possible pairs of vectors from the two
structures that are similar
Uses a graph theory algorithms to find maximal
subset of similar vectors
Overall alignment scores is based on the number
of similar pairs of vectors between the two
structures.

42
VAST

VAST has been used to do an ALL vs. ALL
comparison of proteins in the MMDB (NCBIs
structure database), and to find structure
neighbors for each structure.
MMDB provides service of searching structure
neighbors using VAST.
http//www.ncbi.nlm.nih.gov/Structure/VAST/vast.sh
tml

43
LOCK

Define local secondary structures
Find an initial superposition by using DP to
align secondary structure vectors.
Use greedy algorithms to find nearest neighbors
and minimize RMSD between the C-? atoms from
query and target.
Find the core of aligned C-? atoms and minimize
RMSD between them.

44
Comparison of methods
45
Execution times for comparing a query structure
to 27,000 target structures
46
Execution times for comparing a query structure
to 685 target structures
47
Data

30,000 proteins extracted from PDB
Approx. 27,000 proteins inserted in the geometric
database
0 to 528 segments per protein
13.5 segments on average
48,000,000 triplets
4x20x20x20 table

48
GEOMETRIC PATTERN MATCHING UNDER RIGID MOTION(C.
Guerra, V. Pascucci, 1999)

Problem 1. Find a transformation T, if it exists
that brings A to within a given distance, say e,
of B, i.e. H(T(A),B)
Problem 2. Find the minimum Hausdorff distance
under a rigid motion
D(A, B) min t (t(A), B)
where t is a rigid motion

49
Hausdorff Distance

Let A a1, a2, ..., am B b1, b 2, ..., bn
be sets of either points or segments.
Definition. (Hausdorff Distance)
H(A, B) max (h(A, B), h(B, A))
where the one-way Hausdorff distance is
h(A, B) maxa minb r (a, b)
where a (b) is a point of A (B) and r (a, b), is
a metric.

50
Segment Hausdorff distance

HS(A, B) max (hS(A, B), hS(B, A))
where
hS(A, B) max ai (min bj H(ai,bj))

51
Oriented Segment Hausdorff distance

HOS(A, B) max (hOS(A, B), hOS(B, A)),
where
hOS(A, B) maxai
(minbj (max( d(ais,bjs),d(aie,bje)) ))
ais , aie are the endpoints of ai

52
Exact solution in 2D

This problem is generally solved as a problem of
intersection of unions of disks in the
transformation space.
Time complexity O( m3 n3 log2nm) in R2

53
The Matching algorithm

Find a rigid body transformation (translation
plus rotation) that minimizes the Hausdorff
distance between the segments of A and B.
Derive
T A?B
based on three representative segments of A and
all
triplets of segments of B, and choose the best
T.

54
Practical Approach

1. Select three representative'' segments a,
a' , a' of A as follows
1.a Choose randomly one representative a for A.
1.b Select a' to be the segment containing the
point a'f farthest from the midpoint ac of a .

55
Practical approach (contd)

1.3. Select a'' as the segment that contains the
point at maximum distance from the line ac ,a'f.
2. For each triplet b, b', b'' of elements of B
determine the rotation and translation that maps
a, a' , a'' into b, b' , b''.
3. Choose the best transformation among the
examined ones.

56
Segment Nearest-Neighbor

The nearest-neighbor among n segments in Rd is
equivalent to a query among 2n points in R2d.
HSS(a, b) min(max(d(as,bs),d(ae,be)), max
(d(as,be), d(a e,be)).
Approximate nearest neighbor of a point q in Rd
(within a factor of (1e )) (Arya et al. )
Time complexity O(logn) with O(nlogn)
preprocessing.

57
Complexity Analysis

Time complexity O(mn3log n)
Approximation error factor 8

58
Protein 1rpa
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
Questions?

Write a Comment

User Comments (0)

About PowerShow.com

Protein Structure Comparison PowerPoint PPT Presentation