Finding Largest Wellpredicted Subset of Protein Structure Models - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Finding Largest Wellpredicted Subset of Protein Structure Models

Description:

Discretize d-sphere of ai with grid size. d Use grid points as possible positions of T(bi) ... the angle that moves bi into the d-sphere of ai is an arc. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 27
Provided by: min8159
Category:

less

Transcript and Presenter's Notes

Title: Finding Largest Wellpredicted Subset of Protein Structure Models


1
Finding Largest Well-predicted Subset of Protein
Structure Models
Shuai Cheng Li, Dongbo Bu, Jinbo Xu and Ming
Li University of Waterloo
2
8th Community Wide Experiment on theCritical
Assessment of Techniques for Protein Structure
Prediction
3
Life
Phenotype
4
Protein Structure Model Assessment
  • A number of protein structure prediction methods
    are available.
  • each method produces a large number of models
  • Evaluation of the quality of these models is a
    difficult and fundamental subject
  • Root Means Square Deviation (RMSD) is the most
    popular measuring method.
  • RMSD fails to identify the quality of a model
    when only a substructure is predicted correctly.
  • RMSD is also not equivalent between targets of
    different lengths.
  • the quality of a model of 10 residues with an
    RMSD of 3A is considered bad, while
  • the quality of a model of 100 residues with an
    RMSD of 3A is considered an accurate model.
  • Intensive studies have been conducted
  • MaxSub
  • Global Distance Test (GDT)
  • Local/Global Alignment(LGA)
  • TMScore, etc.
  • Most of the proposed methods are heuristic and do
    not have theoretical performance bound
  • Min I sqrt (? i1 n ai - I(bi)2)/n.

5
Largest Well-Predicted Subset
  • Largest Well-Predicted Subset (LWPS) measure is
    proposed
  • LWPS is elegant and intuitive
  • It was believed to be NP-complete, and heuristic
    approaches have been proposed
  • We found that it was actually solvable in O(n7)
    by techniques from computation geometry.

6
Outline
  • Preliminaries
  • A Discretization of the Rigid Transformation
  • An Randomized Algorithm for Globular Protein
    Structure
  • Approximating the Bottleneck Distance

7
Problem Definition
  • A protein structure A consists of an ordered set
    of n 3D-points
  • A(a1, a2,, an).
  • Each point represents a Ca atom.
  • A is bounded within a sphere of radius RA.
  • RAO(n), for general proteins
  • RAcn1/3, for globular proteins
  • Distance Constraints
  • between two non-consecutive points is no less
    than 4A.
  • between any two consecutive points is about
    3.8\AA.
  • the maximum number of points that can be
    encapsulated in a given sphere with radius r is
    proportional to the volume of the sphere

8
Problem Definition
  • The predicted model B of the protein consists of
    an ordered set of n points,
  • Bb1, b2, , bn.
  • B has the geometry properties of protein
    structure.
  • Given a threshold d and a rigid transformation I,
  • if ai-I (bi)ltd, we say ai matches bi or bi
    fits into ai under I.
  • The match set of B is the set of points of B
    matches the corresponding point of A.
  • When the context is clear, we simply refer it as
    match set.

9
Problem Definition
  • Given a protein structure A, a model B and a
    threshold d, the largest well-predicted subset
    problem (denote as LWPS(A, B, d), is to identify
  • a maximum set Mopt? 1, 2, , n, and
  • a corresponding rigid transformation Iopt (a
    rotation and translation) (See MaxSub, Lan03).
  • d is called the bottleneck distance.
  • Denote AoptAMopt and BoptBMopt.
  • LWPS(A, B, d) is solvable in O(n7).

10
Approximation Versions
  • Distance Approximation for LWPS(A, B, d)
  • Find a transformation T to bring a subset B' ?
    B, such that
  • B' is of size at least Bopt, and
  • ?bi?B', ai-T(bi) lt(1e)d, for some constant
    e.
  • Bottleneck Distance Approximation.
  • Find a transformation T such that
  • ?bi?B, ai-T(bi) lt(1e)dopt

11
Outline
  • Preliminaries
  • A Discretization of the Rigid Transformation
  • An Randomized Algorithm for Globular Protein
    Structure
  • Approximating the Bottleneck Distance

12
Rigid Transformation
  • Rigid transformation T on point set P
  • consists of a rotation and a translation.
  • Decompose Rigid Transformation T into two steps
  • (1) An transformation T
  • that transforms an arbitrary chosen radial pair
    ltp1, p2gt of P into their positions under T,
  • i.e T(p1)T (p1) and T(p2)T (p2)
  • (2) A rotation R around the axis along vector
  • T(p1)?T(p2) ? p? P, R(T(p))T(p).
  • Approximating Rigid Transformation T on P
  • If transformation T moves radial pair ltp1, p2gt
    near enough (e) to T(p1), and T(p2), then there
    exist a rotation R which rotates point set P with
    axis T(p1)?T(p2), such that every point p of P
    is near enough (3e) to T(p).
  • Proof omitted

13
Matching a Radial Pair
  • Assume we know a radial pair ltbi, bjgt of the
    matching set.
  • bi is in the d-sphere of ai
  • bj is in the d-sphere of aj
  • Discretize d-sphere of ai with grid size ?d
  • Use grid points as possible positions of T(bi)
  • Totally, there are O(1/e3) possible positions for
    T(bi)
  • Keep bi fixed, the possible position of T(bj) is
    a sphere
  • Centered at T(bi)
  • Radius is bi-bi
  • Discretize the sphere surface with grid size ?d,
  • there are O(1/e2) possible positions for T(bj)
  • Totally, there are O(1/e5) possible positions for
    ltbi, bjgt

14
Exact Algorithm for Given Rotation Axis
  • If B is rotated around a given axis, an angle
    q?0, 2p), such that
  • the number of matched pairs is maximized.
  • Represent 0, 2p) as a unit circle.
  • the angle that moves bi into the d-sphere of ai
    is an arc.
  • With a plane sweep approach, we can find a point
    on the circle (which corresponding to an angle)
    which contained by maximum number of circles in
    O(nlogn)
  • Approximating the LWPS problem
  • For each pair of B,
  • Try to match the pair to the corresponding pair
    of B approximately, O(1/?5) choices for each
    pair.
  • Rotation B along the pair to find a maximum
    match
  • The running time is O(n3logn/e5)

15
Outline
  • Preliminaries
  • A Discretization of the Rigid Transformation
  • An Randomized Algorithm for Globular Protein
    Structure
  • Approximating the Bottleneck Distance

16
Basic Idea
  • Algorithm LWPS is inefficient.
  • Time complexity O(n3 log(n) )
  • Observation
  • Given a radial pair of Bopt, we can solve LWPS
    in O(n log(n) / ?5 ).
  • If the probability of a pair to be a radial pair
    is high, we can solve the problem with high
    probability by trying only a few pairs instead of
    trying all the possible pairs.

17
Two Key Facts
  • Pseudo radial pair
  • A pair ltbi, bjgt is pseudo radial pair if b1-b2
    gt (1/2 ? n)1/3
  • a is some constant.
  • Fact 1 Pseudo radial pair will introduce small
    error.
  • By creating smaller grid, the constant ? can be
    absorbed.
  • Fact 2 There exist enough pseudo radial pairs.
  • BMopt contains at least ½ M2opt pseudo radial
    pairs.
  • Proof omitted.

18
Randomized Algorithm
  • We randomly sample 1/?2 log(n) pairs from B.
  • There are at least ½ (?n)2 pseudo radial pairs,
  • the probability of no pseudo radial pair is
    1-O(1/n).
  • Time complexity
  • Each pair takes time O( nlogn/ ?5 )
  • O(logn) pairs tried
  • Total time O(n (logn)2 / ?5 )

19
Outline
  • Preliminaries
  • A Discretization of the Rigid Transformation
  • An Randomized Algorithm for Globular Protein
    Structure
  • Approximating the Bottleneck Distance

20
Approximating the Bottleneck Distance
  • To compute the minimum distance d
  • such that each point bi in B can fit into
    d-sphere of ai.
  • A binary search approach is used here.

21
Bound the Bottleneck Distance
  • Let RMSD(A, B) to be D, then
  • Dltdopt,ltsqrt(n)D
  • Proof ommited
  • RMSD can be computed in O(n)
  • Given D, sqrt(n)D
  • Subdivide interval it into subintervals D, 2D,
    (2D, 4D, (4D, 8D, , (2kD, 2k1D,
  • Totally there are O(loglog(n)) sub intervals
  • Use binary search to bound for dopt in one of the
    subintervals.
  • Assume dopt?(2kD, 2k1D for some k.
  • Subdivide (2kD, 2k1D into subintervals of size
    0.5?2kD?
  • Totally there are O(?) intervals.
  • Apply a binary search again to further bound dopt
    into one of the subinterval
  • Totally time complexity is O(n(loglognlog1/e)
    /?5)
  • Each binary search operation takes time O(n/?5)
  • Since we want to match all the point in B, a
    radial pair can be pre-computed

22
Experimental Results
23
  • TMscore and MaxSub are two popular package for
    protein structure comparison.
  • We compare ApproxSub against MaxSub and TMscore
    in terms of finding the number of matched pairs.
  • Testing set includes 1fc2, 1enh, 2gb1, 2cro,
    1ctf, and 4icb. These six proteins are commonly
    used for testing protein structure prediction
    methods.

24
  • X the ratio of matched pair by MaxSub (TMscore)
  • Y the ratio of matched pair by ApproxSub
  • Observation
  • ApproxSub can find more matched pairs than MaxSub
    and TMscore
  • MaxSub/TMscore are poor to find matched pairs
    when ratiogt0.60.
  • This is due to the heuristic nature of these two
    methods, and they extend matches by superimposing
    a local fragment of A and B first. This may trap
    them at a local minimum.

25
  • Running Time
  • X number of residues of protein
  • Y CPU time (sec)
  • Observation
  • ApproxSub is much faster than TMscore.

26
  • Q and A
Write a Comment
User Comments (0)
About PowerShow.com