Loading...

PPT – Protein Structure Comparison PowerPoint presentation | free to view - id: 107598-YTI2Z

The Adobe Flash plugin is needed to view this content

Protein Structure Comparison

Protein Structure Alignment

Human Hemoglobin alpha-chain pdb1jebA

Human Myoglobin pdb2mm1

Another example G-Proteins 1c1yA,

1kk1A6-200 Sequence id 18 Structural id 72

Protein Structure ComparisonMotivation

- Understanding of the folding process.
- Protein classification
- Finding binding sites of the protein
- Identifying structurally conserved regions in the

protein

Comparison of proteins by sequence or shape?

- Protein sequence comparison is simpler
- 3D structures are available in a few percent of

all proteins - Shape similarity is detectable even though the

sequences may have changed in the course of

evolution

Different Instances of Structure Comparison

- All-to-All comparison
- Classify all known structures
- Search for a structural motif
- Study interaction between structures and other

molecules (Protein Docking) - Use known structures to predict structure from

sequence (Protein Threading)

Sequence order dependence

- Sequence order dependent alignment
- an 1-D task.
- Sequence order independent alignment
- a real 3-D task.

Why Sequence order dependence

- Substructures preserving sequence order might be

biologically more meaningful - With the sequence order constraint the

computational task is simpler

The problem

- Given a pair of protein structures, find the

correspondences between the Ca atoms of the

backbone that best align the two structures - Tradeoffs between the number of corresponding

atoms and the lowest distance

Finding Correspondences

- Point-based approaches
- Geometric Hashing, Indexing (Wolfson et al.,

1998) - Comparison based on 2D or 3D distance matrices

(Holm, Sander, 96) - Dynamic Programming (Gerstein, Levitt, 98)
- Combinatorial Extension (Bourne et al, 96)

Algorithms for Structure Alignment

- Distance based methods
- DALI(Holm and Sander) Aligning scalar distance

plots - STRUCTAL(Gerstein and Levitt) Dynamic

programming using pairwise inter-molecular

distances - SSAP(Orengo and Taylor) Dynamic programming

using intra-molecular vector distance - Vector based methods
- VAST (Bryant) Graph theory based secondary

structure alignment - 3dSearch (Singh and Brutlag) Fast secondary

structure index lookup - Both vector and distance based
- LOCK (Singh and Brutlag) Hierarchically uses

both secondary structures vectors and atomic

distances

Hashing function

- From an Object
- To invariant Features
- To t-ple of numbers
- To indeces
- Use the t indeces to access a t-dimensional hash

table

Indexing Methodsfor Fast retrieval of 3D

patterns

- Select a set of target proteins
- Create and store a hash table indexed by

invariant geometric properties of the selected

folds - Update the databases as new structures are found
- Use the table to identify the nearest fold for a

target protein.

Reference Frame

- A 3-D reference frame (r. f.) can be defined by

three non collinear points - Invariant
- the coordinates of any other point in the r.f.

Secondary Structures Representation

- Secondary structures are represented as linear

vectors (segments) - the axis for the alpha helix and the best fit

segment for a beta strand - A SVD-based alignment algorithm is used to match

an a helix segments with known axes to determine

helix axis. Direct segment fits were made to fit

b-sheet strands.

Visualization

- Each segment associated to a secondary structure

is displayed as a cylinder

Secondary Structure-based Approaches

- Geometric Hashing, Indexing (Wolfson et al, Holm

et al, Guerra et al ) - Graph-based (Grindley et al)
- Dynamic Programming (Singh, Brutlag)

Indexing techniques based on Secondary

Structures(Guerra et al)

- Consider all the triplets of secondary structures

and their associated segments - Construct a 3D table indexed by the angles

relating three secondary structures.

Table Construction

- For each triplet a1 , a2 , a3 of secondary

structures of protein P - compute the angles between
- (a1 , a2 ), (a1 , a3 ), and (a2 , a3 ),
- and use them as indexes to an entry in the a-a-a

Table where (P, (a1 , a2 , a 3)) is stored. - Each cell of the table at the end contains

information about all triplets that hashed into

it (including distances between secondary

structures)

Table construction

- Time Complexity O(s3n)
- s is the of secondary structures in a

protein - n the of proteins.

Searching the table

- For a query protein, compute the same invariants

used for the target proteins. - For each invariant and corresponding indeces,

access the corresponding cell in the table where

a vote is cast - List the target proteins according to their votes

Distribution of table entries(D. Platt, C.

Guerra, I. Rigoutsos, G. Zanotti, 2003)

- There is a strong preference for triplets to fall

into cells with indexes a,b, g satisfying - a b g
- corresponding to segments lying on parallel planes

Analysis of Distribution of globally selected

secondary structures

- Distributions show much stronger preference for

alignment than expected for randomly uniform

vectors. - There is a greater preference for alignment

between any two secondary structure elements if a

third structure element aligns with either of the

first two -- the alignment angles are not

independent variates.

(No Transcript)

(No Transcript)

(No Transcript)

Geometric Indexing

PDB contains 27000 proteins

Hash Table

Query protein

Proteins Superposition

Search for Similarity

Hypotheses of Similarity

List of Similar Proteins

Pairwise atomic Superimposition with a selected

protein

Alignment by Dynamic Programming

Refinement of the matching procedure

- Alignment
- Find a collection of corresponding pairs of

secondary structures (SS) which maximizes a

given similarity measure - Dynamic programming

D(i,j) number of times SS i of protein A is

associated to SS j of protein B in a triplet of

equivalent SS

Problem

- Find the increasing path in the D(i,j) matrix

that maximizes the total similarity measure

Dynamic Programming

- Let M be a 2D matrix such that M(i,j) is the

similarity measure between s1 s2 .... si and

t1t2 .... tj - Compute
- M(i,j) max M(i-1,j) d(ti , f),
- M(i-1,j-1) d(si tj),

M(i,j-1) d(f, s j ) - The solution
- D(A,B)M(n,m)
- Quadratic time complexity

Integration of matching strategies

- Using different protein representations at
- atomic level
- secondary structure level
- sequence level

Superposition

- Find a rigid transformation which optimally

superimposes the atoms of two proteins - Horn method

Accuracy vs coverage

- Accuracy how many of the solutions found were

correct? - A F intersection T /F
- Coverage How many of the correct solutions were

found? - C F intersection T /T

T correct sol.

Fsolutions found

Evaluating PROuST

- Using as standard of truth SCOP
- Against other existing servers

Accuracy of results for 1tim chain A

Another Algorithm DALI

- Based on aligning 2-D intra-molecular distance

matrices - Computes the best subset of corresponding

residues from the two proteins such that

similarity between the 2-D distance matrices is

maximized. - Searches through all possible alignments of

residues using Monte-Carlo algorithms

DALI

Distance matrix (2)

- Advantages
- - invariant with respect to rotation and

translation - - can be used to compare proteins
- Disadvantages
- - the distance matrix is O(n2) for a protein

with n residues - - comparing distance matrix is a hard problem
- - insensitive to chirality

Distance matrix

5.9

2

4

8.1

3

6.0

1

DALI

- DALI has been used to do an ALL vs. ALL

comparison of proteins in the PDB, and to create

a hierarchical clustering of families. - FSSPFold classification based on

Structure-Structure alignment of Proteins - http//www.ebi.ac.uk/dali/fssp/fssp.html

VAST-Vector Alignment Search Tool

- Aligns only secondary structure elements (SSE)
- Represents each SSE as a vector
- Finds all possible pairs of vectors from the two

structures that are similar - Uses a graph theory algorithms to find maximal

subset of similar vectors - Overall alignment scores is based on the number

of similar pairs of vectors between the two

structures.

VAST

- VAST has been used to do an ALL vs. ALL

comparison of proteins in the MMDB (NCBIs

structure database), and to find structure

neighbors for each structure. - MMDB provides service of searching structure

neighbors using VAST. - http//www.ncbi.nlm.nih.gov/Structure/VAST/vast.sh

tml

LOCK

- Define local secondary structures
- Find an initial superposition by using DP to

align secondary structure vectors. - Use greedy algorithms to find nearest neighbors

and minimize RMSD between the C-? atoms from

query and target. - Find the core of aligned C-? atoms and minimize

RMSD between them.

Comparison of methods

Execution times for comparing a query structure

to 27,000 target structures

Execution times for comparing a query structure

to 685 target structures

Data

- 30,000 proteins extracted from PDB
- Approx. 27,000 proteins inserted in the geometric

database - 0 to 528 segments per protein
- 13.5 segments on average
- 48,000,000 triplets
- 4x20x20x20 table

GEOMETRIC PATTERN MATCHING UNDER RIGID MOTION(C.

Guerra, V. Pascucci, 1999)

- Problem 1. Find a transformation T, if it exists

that brings A to within a given distance, say e,

of B, i.e. H(T(A),B) - Problem 2. Find the minimum Hausdorff distance

under a rigid motion - D(A, B) min t (t(A), B)
- where t is a rigid motion

Hausdorff Distance

- Let A a1, a2, ..., am B b1, b 2, ..., bn

be sets of either points or segments. - Definition. (Hausdorff Distance)
- H(A, B) max (h(A, B), h(B, A))
- where the one-way Hausdorff distance is
- h(A, B) maxa minb r (a, b)
- where a (b) is a point of A (B) and r (a, b), is

a metric.

Segment Hausdorff distance

- HS(A, B) max (hS(A, B), hS(B, A))
- where
- hS(A, B) max ai (min bj H(ai,bj))

Oriented Segment Hausdorff distance

- HOS(A, B) max (hOS(A, B), hOS(B, A)),
- where
- hOS(A, B) maxai
- (minbj (max( d(ais,bjs),d(aie,bje)) ))
- ais , aie are the endpoints of ai

Exact solution in 2D

- This problem is generally solved as a problem of

intersection of unions of disks in the

transformation space. - Time complexity O( m3 n3 log2nm) in R2

The Matching algorithm

- Find a rigid body transformation (translation

plus rotation) that minimizes the Hausdorff

distance between the segments of A and B. - Derive
- T A?B
- based on three representative segments of A and

all - triplets of segments of B, and choose the best

T.

Practical Approach

- 1. Select three representative'' segments a,

a' , a' of A as follows - 1.a Choose randomly one representative a for A.
- 1.b Select a' to be the segment containing the

point a'f farthest from the midpoint ac of a .

Practical approach (contd)

- 1.3. Select a'' as the segment that contains the

point at maximum distance from the line ac ,a'f.

- 2. For each triplet b, b', b'' of elements of B

determine the rotation and translation that maps

a, a' , a'' into b, b' , b''. - 3. Choose the best transformation among the

examined ones.

Segment Nearest-Neighbor

- The nearest-neighbor among n segments in Rd is

equivalent to a query among 2n points in R2d. - HSS(a, b) min(max(d(as,bs),d(ae,be)), max

(d(as,be), d(a e,be)). - Approximate nearest neighbor of a point q in Rd

(within a factor of (1e )) (Arya et al. ) - Time complexity O(logn) with O(nlogn)

preprocessing.

Complexity Analysis

- Time complexity O(mn3log n)
- Approximation error factor 8

Protein 1rpa

(No Transcript)

(No Transcript)

(No Transcript)

Questions?