Reductionism and Classification Require Detailed Comparison Consider 3D Comparison - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Reductionism and Classification Require Detailed Comparison Consider 3D Comparison

Description:

Twilight Zone. Midnight Zone. Sequence vs Structure Another ... CDhit http://bioinformatics.org/cd-hit/ - popular algorithm for fast clustering of sequences ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 42
Provided by: PhilB51
Category:

less

Transcript and Presenter's Notes

Title: Reductionism and Classification Require Detailed Comparison Consider 3D Comparison


1
Reductionism and Classification Require Detailed
ComparisonConsider 3D Comparison
  • Pharm 201/Bioinformatics I
  • Philip E. Bourne
  • Department of Pharmacology, UCSD
  • Reading Chapter 16, Structural Bioinformatics

2
Consider this Course a Workflow
Understand the experiment to understand
the errors
Data In
Understand the scope and complexity of the data
Understand how to best represent (model) the data
Understand the methods to physically
instantiate the model
From initial analysis understand how to
control data in
Recognize redundancy In the data
Classify the data
Visualize the data
Analyze the data
3
From Last Time
  • We established the complex relationship between
  • Sequence and Structure
  • Structure and Structure
  • Structure and Function
  • Today we analyze how the relationships between
    structure and structure are established

4
Agenda
  • Understand why structure comparison is important
  • Understand why it is not a solved problem
  • Understand the basics of the methods used to
    address the problem
  • Understand one method (CE) in more detail
  • Review an example where structure comparison has
    revealed new biological insights

5
Why Structure Comparison is Important
  • Reductionism needed to classify protein
    structures
  • Functional assignment and hopefully new biology
  • Alignment of predicted structure against
    structural templates
  • Establish improved sequence relationships not
    possible from sequence alone
  • Protein engineering

6
Distinctions - Structure Superposition and
Structure Comparison and Alignment are Different
  • Structure superposition assumes you already know
    which atoms to superimpose it merely optimizes
    for the atoms chosen (relatively simple)
  • Structure alignment must first determine what
    atoms to align (difficult). We are concerned with
    alignment

7
Distinctions Pair-wise Alignments are Different
from Multiple Structure Alignments
  • Multiple structure alignment algorithms are rare
    and of questionable quality (see for example
    Nucleic Acids Research (2004), 32 W100-W103
  • Multiple structure alignments should not be
    confused with multiple pair-wise alignments
  • Here we focus on single pair-wise comparison and
    alignment

8
Why is it Not a Solved Problem?
9
Current State of the Art
  • There are many papers published on this, but
    relatively few have code to download or Web sites
    from which to perform comparisons
  • All methods can identify obvious similarities
    between two structures
  • Remote similarities are detected by a subset of
    methods different remote similarities are
    recognized by different methods
  • Good alignments are much harder to come by
  • Speed is a serious issue with some algorithms

10
Desirables
  • Biologically meaningful alignments not just
    geometrically meaningful
  • Complete database of all alignments
  • Ability to apply to structures not in the PDB

11
Biological vs Geometric Alignments Plastocyanin
versus Azurin (from Godzik 1996)
Maintain 9 of 10 interactions RMSD 1.5 Å
Maintain 5 of 10 interactions RMSD 0.5 Å
12
Literature Alignments - Flavodoxin vs Che Y
Protein From Godzik (1996) Protein Science, 5,
1325-1338.
13
Understand the basics of the methods used to
address the problem
14
See also http//en.wikipedia.org/wiki/Structural_a
lignment_software
15
How to Compare Structures?
Structure 1
Structure 2
Feature extraction
1.
Structure description 1
Structure description 2
Comparison algorithm
2.
3.
Scores
Statistical significance
Similarity, classification
16
Components of Structure Alignment
  • Structure Description
  • Local geometry
  • Side chain contacts
  • Geometric hashing
  • Distance matrix (Dali, 1993)
  • Properties (secondary structure, hydrophobic
    clusters (Comparer, 1990)
  • Secondary structure elements (VAST, 1996)
  • Distances of inter intra aligned fragment
    pairs (CE, 1998)
  • Contact map (Celera, 2004)
  • Geometry invariants (Jia et al, 2004)

17
Components of Structure Alignment
  • 2. Alignment algorithms
  • Monte Carlo (Dali, VAST)
  • Heuristics (CE)
  • Dynamic Programming (CE)
  • Probabilistic
  • Statistical significance

18
Components of Structure Alignment
2. Alignment algorithms
  • Input output of alignment algorithm
  • Input two proteins and
  • Output An alignment
  • and scores
  • Constraints
  • min rmsd
  • max L
  • min Gaps
  • Dynamic programming, Integer programming, Monte
    Carlo

3. Statistical significance
  • Levitt and Gerstein, PNAS, 1998
  • Random Model and CE scoring function (Jia et al,
    2004)

19
Understand one method (CE) in more detail
  • I.N. Shindyalov and P.E. Bourne Protein
    Engineering 1998, 11(9) 739-747. Protein
    Structure Alignment by Incremental Combinatorial
    Extension of the Optimum Path. PDF File 793
    citations!

20
Basic Approach
  • Compare octameric fragments an aligned fragment
    pair (AFP) (local alignments)
  • Stitch together AFPs
  • Find the optimal path through the AFPs
  • Optimize the alignment through dynamic
    programming
  • Measure the statistical significance of the
    alignment

21
Why This Approach?Alignment Space is Very Large
and Must be Constrained Without Loosing
Meaningful Alignments
  • Similarity Matrix S where
  • S(nA-m).(nB-m)
  • m Length of AFP
  • nA Length of protein A
  • This is very large to compute constraints are
    needed

22
(No Transcript)
23
Definition of the Alignment Path
  • pAi AFPs starting residue position in protein A
    at the ith position
  • of the alignment path
  • m longest continual path set as 8
  • One of the conditions (1)-(3) should be satisfied
    for 2 consecutive AFPs i
  • and i1 in the path
  • 2 consecutive AFPs aligned without gaps
  • Two consecutive AFPs with a gap in protein A
  • Two consecutive AFPs with a gap in protein B

24
Extension of the Alignment Path
Gap sizes are limited to G heuristically set as
30 residues
25
Evaluation based upon the following three
distance similarity measures
1. Distance calculated from independent set of
inter-residue distances where each
distance is used only once - used for
combinations of 2 AFPs
2. Full set of inter-residue distances - used for
a single AFP
3. RMSD from least squares superposition - used
to select few best fragments
26
Evaluation Based Upon the Following Three
Distance Similarity Measures
1. Distance calculated from independent set of
inter-residue distances where each
distance is used only once
2. Full set of inter-residue distances
3. RMSD from least squares superposition
27
How to Extend the Path?
1. Consider all possible AFPs that extend the path
2. Consider only the best AFP
3. Use some intermediate strategy
28
How to Extend the Path?
1. Consider all possible AFPs that extend the
path Computationally expensive
2. Consider only the best AFP Works well
with the right heuristics
3. Use some intermediate strategy
29
What Heuristics?
Candidate AFPs are based upon (9) D0 3Å The
best AFP is based upon (10) D1 4Å The
decision to extend or terminate the path is based
upon (11)
30
Z-Score
  • Evaluate the probability of finding an alignment
    path of the same length or smaller gaps and
    distance from a random set of non-redundant
    structures

31
Optimization of the Final Path
The 20 best alignments with a Z score above 3.5
are assessed based on RMSD and the best kept.
This produces approx. one error in 1000
structures
Each gap in this alignment is assessed for
relocation up to m/2
Iterative optimization using dynamic programming
is performed using residues for the superimposed
structures
32
Test Case Phycocyanin versus Colicin A
33
Cyclin-dependent kinases Open (purple) Closed
(blue) Pavelitch et al. (1997)
34
Limitations
  • Will not find non-topological alignments (outside
    the bounds of the dotted lines)
  • What are the correct units to be comparing?
  • CE works on chains as we shall see in future
    weeks domains are the correct units, but
    definition of the domains is not straightforward

35
Computation of All x All
  • Took 11,748 chain in the PDB (1/98)
  • Computed for 1868 representatives
  • 24,000 Cray T3E processor hours
  • Loaded pairwise alignments into
  • database

36
1-2 Years Ago
  • 40,000 proteins 70,000 chains
  • 70,0002/2 30 seconds 2330 yrs
  • Options
  • Use a redundant set of chains
  • Use parallel architectures
  • D. Pekurovsky, I.N. Shindyalov, P.E. Bourne 2004
    High Throughput Biological Data Processing on
    Massively Parallel Computers. A Case Study of
    Pairwise Structure Comparison and Alignment Using
    the Combinatorial Extension (CE) Algorithm.
    Bioinformatics, 20(12) 1940-1947 PDF.

37
Now
  • Using egrid to compute all by all for CE and
    FatCat

38
One Criteria for Redundancy
  • Remove highly homologous chains
  • The  RMSD between two chains is less than 2Å
  • The length difference between two chains is less
    than 10
  • The number of gap positions in alignment between
    two chains is less than 20 of aligned residue
    positions
  • At least 2/3 of the residue positions in the
    represented chain are aligned with the
    representing chain.

39
Review example where structure comparison has
revealed new biological insights
40
Example
  • CE revealed putative Ca binding domain in
    acetylcholinesterase
  • Sequence similarity to neuroligins predicts Ca
    binding too confirmed experimentally
  • Members of the a/b hydrolase family bind Ca
    which may be important for heterologous cell
    associations

Structural similarity between Acetylcholinesterase
and Calmodulin found using CE (Tsigelny et al,
Prot Sci, 2000, 9180)
41
The Future(also a general rule)
  • Gold standards are important
  • For structure comparison a human generated
    alignment standard is important
  • Algorithms are then challenged to meet the
    standard
  • Eventually those algorithms highlight problems
    with the standard
  • The cycle continues
Write a Comment
User Comments (0)
About PowerShow.com