Protein Structure Prediction - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Protein Structure Prediction

Description:

Relation to existing structures, ab initio, homology, fold, etc. ... Ab Initio. No similar structures in DB. Most fundamental problem. Other issues ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 47
Provided by: compbi
Category:

less

Transcript and Presenter's Notes

Title: Protein Structure Prediction


1
Protein Structure Prediction
2
Protein structure
  • Most proteins will fold spontaneously in water,
    so amino acid sequence alone should be enough to
    determine protein structure
  • However, the physics are daunting
  • 20,000 protein atoms, plus equal amounts of
    water
  • Many non-local interactions
  • Can takes seconds (most chemical reactions take
    place 1012 --1,000,000,000,000x faster)
  • Empirical determinations of protein structure are
    advancing rapidly.

3
Protein review
  • Proteins are polymers of amino acids linked by
    peptide bonds.
  • Properties of proteins are determined by both the
    particular sequence of amino acids and by the
    conformation (fold) of the protein.
  • Flexibility in thebonds around C?
  • ? (phi)
  • ? (psi)
  • sidechain

4
Protein Structure Levels
  • Protein structure is described in four levels
  • Primary structure amino acid sequence
  • Secondary structure local (in sequence) ordering
    into
  • (?)Helices compressed, corkscrew structures
  • (?)Strands extended, nearly straight structures
  • (?)Sheets paired strands, reinforced by hydrogen
    bonds
  • parallel (same direction) or antiparallel sheets
  • Coils, Turns Loops changes in direction
  • Tertiary structure global ordering (all
    angles/atoms)
  • Quaternary structures multiple, disconnected
    amino acid chains interacting to form a larger
    structure

5
Protein structure cartoons
6
Protein Structure Representations
  • Differentvisualizationsshow variousaspects
    ofstructure

7
Protein Folding
  • Proteins are created linearly and then assume
    their tertiary structure by folding.
  • Exact mechanism is still unknown
  • Mechanistic simulations can be illuminating
  • Proteins assume the lowest energy structure
  • Or sometimes an ensemble of low energy
    structures.
  • Hydrophobic collapse drives process
  • Local (secondary) structure proclivities
  • Internal stabilizers
  • Hydrogen bonds, disulphide bonds, salt bridges.

8
Empirical structure determination
  • Two major experimental methods for determining
    protein structure
  • X-ray Crystallography
  • Requires growing a crystal of the protein
    (impossible for some, never easy)
  • Diffraction pattern can be inverse-Fourier
    transformed to characterize electron densities
    (Phase problem)
  • Nuclear Magnetic Resonance (NMR) imaging
  • Provides distance constraints, but can be hard to
    find a corresponding structure
  • Works only for relatively small proteins (so far)

9
X-ray crystallography
  • X-rays, since wavelength is near the distance
    between bonded carbon atoms
  • Maps electron density, not atoms directly
  • Crystal to get a lot of spatially aligned atoms
  • Have to invert Fourier transform to get
    structure, but only have amplitudes, not phases
  • Guess! orperturb...

10
NMR structure determination
  • NMR can detect certain features of hydrogen
    atoms
  • NOESY measures distances between non-bonded H's
    within about 5A
  • COSY and TOCSY described relations through bonds
  • Combination of distance and angle constraints,
    plus knowledge of covalent bonds (amino acid
    sequence) determines a unique (sometimes)
    structure.
  • Overlapping measurement limits size 120AA

11
Why predict protein structure?
  • Neither crystallography nor NMR can keep pace
    with genome sequencing efforts
  • Only 10566 (3641 with lt90 identity) human
    proteins in PDB, although growing fast
  • Computer scientists love this problem
  • Understandable with minimal biology
  • Seems like a good discrimination task
  • Understand the mechanisms of folding (?)
  • First computational Nobel prize?

12
Kinds of Structure Prediction
  • Comparative modelling
  • Homolog has known structure, which is adjusted
    for sequence differences
  • Energy minimization and molecular dynamics
  • Fold recognition
  • Proteins fall into broad fold classes. Models of
    folds that recognize compatible sequences.
    Inverse problem
  • Predict more than fold class?
  • Ab initio or new fold prediction
  • No homologs, not recognized by any fold model

13
Ab Initio predictions
  • Three broad approaches
  • Molecular dynamics, energy minimization
    approaches
  • Empirical black box (induce discriminators)
  • Mechanistic (follow the actual folding path)
    approaches. Hybrid between energy and empirical
    methods.
  • Secondary structure predictions
  • Not tremendously useful nor accurate, but
    simplest.
  • Can play a role in tertiary predictors
  • Tertiary structure predictions
  • Best involve a complex mixture of approaches

14
Energy Minimization
  • Many forces act on a protein
  • Hydrophobic inside of protein wants to avoid
    water
  • Packing atoms can't be too close, nor too far
    away
  • Bond angle/length constraints
  • Long distance, e.g.
  • Electrostatics Hydrogen bonds
  • Disulphide bonds
  • Salt bridges
  • Can calculate all of these forces, and minimize
  • Intractable in general case, but can be useful

15
Empirical models
  • Pose structure prediction as induction task.
  • What are the inputs and outputs?
  • Where do we get enough training data?
  • Which induction methods work best?
  • Long history in bioinformatics

16
Initial approaches to secondary structure
prediction
  • Input is a "sliding window" of immediately
    surrounding sequence assumed to determine
    structure (no long distance interactions)
    ...mnnstnssnsgla...
    H
  • Output is one of three possible secondary
    structure states helix, strand, other

17
Why might this work?
  • There are local propensities to secondary
    structural classes (largely hydropathy)
  • Helices no prolines, sometimes amphipathic (show
    alternating hydropathy with period 3.6 residues)
  • Strands either alternating hydropathy or ends
    hydrophillic and center hydrophobic
  • Neither small, polar flexible residues.
    Prolines.
  • Minimum lengths for secondary structures (helices
    longer than strands)

18
Early methods
  • Chou-Fasman method looked at frequency of each
    amino acid in window
  • GOR defined an information measure I(SR)
    logP(SR)/P(S) where S is secondary structure
    and R is amino acid. Define information gain as
    I(SR) - I(SR) and predict state with
    highest gain.
  • How to combine info gain for each element of
    sliding window? Independently (just add) or by
    pairs

19
How well did they work?
  • Not very Roughly 50-55 accurate on a residue by
    residue basis.
  • Random prediction that obeyed the observed
    distribution of helix/strand/other would be 40
  • Different ways to calculate "correctness"
  • Needs to be unbiased (especially wrt homology)!
  • Getting number of helices and strands or order
    right is harder than just counting residue by
    residue (like the difference between nucleotide
    and exon level gene finding).

20
Fancier induction techniques
  • Same setup as Chou-Fasman or GOR
  • Sliding window across amino acid sequence as
    input
  • Three class output (helix/sheet/other)
  • Various different induction techniques over same
    data, give modest improvements
  • LDA/QDA
  • Decision trees
  • Neural networks
  • Best results from neural networks ( 62)

21
Add multiple sequence alignment information
  • This is helpful in principle
  • insertions/deletions more likely to be coil/turn
  • conserved hydropathy more important for
    prediction than non-conserved.
  • GOR method improves 8-9 points (to about 64
    correct residue by residue).
  • Similar improvement for NNs (to 68)
  • SVMs gain a bit more, to about 70

22
But the information isn't there
  • Prediction quality has not improved much even
    with huge growth of training data.
  • Secondary structure is not completely determined
    by local forces
  • Long distance interactions do not appear in
    sliding window
  • Empirical studies show same amino acid sequences
    can assume multiple secondary structures.

23
Mechanistic models
  • Move from purely empirical to include some
    knowledge of folding mechanisms
  • Compact nature of conformations
  • Hydrophobic packing
  • Sequences of secondary structures
  • Secondary structure predispositions
  • Heuristic global energy minimization

24
Hydrophobic packing models
  • Dill's HP model
  • Two classes of amino acids, hydrophobic (H) and
    polar (P)
  • Lattice model for position of (point) amino
    acids.
  • Thread chain of H's and P's through lattice to
    maximize number of H-H contacts

3D
2D
25
But...
  • Even the 2D HP packing problem (which is easier
    than the 3D one) turns out to be NP complete!
  • Good approximation results exist.
  • 3/8 of optimal approximation (3D)
  • In triangular lattice, algorithm for gt60 of
    optimal packing
  • Other interesting results in the model, e.g.
  • Which sequences have a single optimal fold?

26
CASP changed the landscape
  • Critical Assessment of Structure Prediction
    competition. Even numbered years since 1994
  • Solved, but unpublished structures are posted in
    May, predictions due in September, evaluations in
    December
  • Various categories
  • Relation to existing structures, ab initio,
    homology, fold, etc.
  • Partial vs. Fully automated approaches
  • Produces lots of information about what aspects
    of the problems are hard, and ends arguments
    about test sets.
  • Results showing steady improvement, and the value
    of integrative approaches.

27
CASP 6 Categories
  • Human intervention versus fully automated
    predictions
  • Comparative modeling
  • A structure exists for a good homolog
  • Looking for mutations, bond rotations, etc.
  • Fold recognition (Homologs)
  • Distant homolog recognition and adaptation
  • Looking at loop placement, domain boundaries
  • Fold recognition (Analogous)
  • No homolog, but similar structures in DB
  • Finding the right model structure
  • Ab Initio
  • No similar structures in DB. Most fundamental
    problem.
  • Other issues
  • Domain boundaries, disordered regions,
    residue-residue contacts

28
CASP Results
  • Fully automated methods now nearly as good as
    ones with human intervention
  • Consensus methods (looking for agreement among
    servers) do best overall, but not by much and not
    all the time.
  • Consistent best approach is Rosetta from David
    Bakers lab

29
CASP performance improving
30
Baker best strategy so far
  • Two step process
  • Generate a good sized collection of plausible
    structures and near-miss bad structures
  • Requires a good energy function, good
    optimization approach
  • Quality of decoy (incorrect, but plausible
    folds) is important
  • Build discriminators to separate correct from
    decoy structures.
  • Rosetta (Baker lab) and fully automated Robetta.
  • Ran away with CASP4, still the best at CASP5 6
  • Robetta almost as good as Rosetta
  • Outstanding at ab initio, competitive at the
    rest.

31
Rosetta approach
  • Integrated method
  • I-Sites much finer grained substructures than
    secondary structures. A library of all
    consistent structures of short polypeptides is
    defined (taken from PDB)
  • Build initial models by assigning I-sites to new
    amino acid sequence (many possibilities)
  • Monte Carlo search through assignments of I-Sites
    to minimize energy function.
  • Use of sophisticated global energy function
  • Take good scoring structures, and test them on a
    decoy detector, which looks for high scoring
    but non-native structure patterns.

32
I-Sites
  • I-sites are a set of sequence patterns that
    strongly correlate with protein structure at the
    local level.
  • Ungapped amino acid sequence motifs
  • Length 3-9 (now longer)
  • Originally 82 classes (now more)
  • Defined by amino acid log odds matrix and phi/psi
    angles
  • Far more detailed than the 3 state
    helix/sheet/other local structure models

33
Example I-Site
  • Proline containing alpha helix C-cap

f/j
cartoon
AA log odds
member structures
Motif position
34
How I-sites are defined
  • Starting from all sequences in PDB at the time
  • Remove sequences with 25 or greater sequence
    identity to compensate for oversampling of
    certain families
  • Cluster all possible subsequences of these
    structures of length 3-15.
  • For each cluster, define paradigm structure.
  • Remove members that are too far away structurally
  • Add new members that are structurally similar
  • If can't distinguish well (bimodal scores)
    between members and non-members, drop the cluster

35
I-sites are not unique
  • One amino acid subsequence may be compatible with
    several I-sites
  • I-sites are not defined to be mutually exclusive
    over sequence.
  • Slightly different starting positions or lengths
    may yield quite different (even incompatible)
    I-sites for the same sequence region.
  • This has biological relevance
  • Local predispositions are not determinative or
    unique
  • Multiple predispositions are more informative
    than none.

36
I-sites pro and con
  • Lots of ad hoc fiddling to get I-site library
  • Distance measure on sequence has two free
    parameters
  • Many different structure distance measures tried
  • K-means clustering (K is free parameter)
  • Test for bimodal scoring (two more parameters)
  • Occasional subdivision of an I-Site that seemed
    to have two good structures associated with it
  • Corresponds reasonably well to existing
    crystallographic concepts (e.g. Type II b turns)
  • They are more predictable than H/S/C

37
HMMSTR
  • I-sites often overlap (sequences of sites
    corresponds to traditional local structures)
  • Basic idea Hidden Markov Model for sequences of
    I-sites
  • No in/dels. States specifydistributions of
  • Amino acids
  • secondary structures
  • f/j angles (discretized)
  • structural context

38
Simple HMMSTR
  • Simple example for well knownstructural motif
  • Combination of twoI-sites which overlap
  • States defined bypositions in an I-site
  • Alternative pathsfor different I-sites

39
  • Whole HMMSTRmodel
  • Each node hasstart probability
  • Specifies transitionsbetween typesof local
    structureas well as within them

40
Training of HMMSTR
  • Many ad hoc approaches based on biological
    intuitions
  • When to merge overlapping states?
  • Dynamic programming to find likely transitions
    between I-sites
  • Null transition state to connect otherwise
    disconnected subtrees.
  • Model surgery adding, splitting and deleting
    states.
  • Structure predictions by voting rather than
    most probable parse.

41
Beating HMMSTR
  • OK, but not great results in predictive accuracy.
  • Too many alternative paths through the model, and
    difficulty choosing between them on the basis of
    sequence alone.
  • Only local information no global measures used.
  • Rosetta add global information to I-site
    assignments and get a big improvement

42
Rosetta prediction method
  • Define global scoring function that estimates
    probability of a structure given a sequence
  • Generate version of I-sites with fixed length
    subsequences (9 amino acids)
  • Calculate P(I-Sitesequence) for all sequences
    and I-sites
  • Generate structures by Monte Carlo sampling of
    assignments of fixed size I-sites to subsequences
  • End up with ensemble of plausible structures

43
Rosetta Scoring Function
  • Global scoring function issues
  • Distinguish native-like structures from not.
    Generation methods unlikely to produce exact
    native structure.
  • Decoy testing. Create many structures that are
    plausible and not too far from native fold, and
    try to distinguish these
  • Bayesian approach
  • Sequence dependent and sequence independent
    evaluation of predicted structure.

44
Score Decomposition
45
Good Performance
  • An ab initio target
  • Red correct, Grey incorrect
  • Missed a sheet
  • Good overalltopology

46
And bad
  • Hardest structure forall prediction methods
  • Central sheet regioncontains loops andtwo small
    helices
  • Single hydrogen bondextends and alterstwo
    substructures
Write a Comment
User Comments (0)
About PowerShow.com