Title: Protein families and structure prediction
1Protein families and structure prediction
- Classification of proteins by sequence
similarity - Prediction of 2D and 3D structure from amino
acid sequence
2Protein classification and structure prediction
Protein classification and structure prediction
- Protein classification schemes include the
following - family (gt50 identity) and superfamily
(significant identity but ltlt50) based on
sequence alignments - domain classification (PFAM, interpro) based on
local alignments of domains - global clustering analysis based on degree of
similarity or significance of alignment score
following pair-wise alignment
3Maximal linkage clustering
- sequence similarity ()
- seq 1 2 3 4 5 6 7
- 70 60 23 28 26 17
- 75 20 30 24 20
- 24 25 22 15
- 65 53 13
- 60 12
1
2
3
Find the largest cluster, that maximizes the ave.
score for cluster1,2,3 ave. (706075)/3
68 but for 1,2,3,4 ave. 706075232024 45
so dont add 4 to 1,2,3 seq. 7 is an orphan
5
6
4
7
Strong link high sequence similarity or very
low E value
4Flow chart for structure prediction
Protein sequence
Database similarity search
Protein family, domain, cluster analysis
Does sequence align with protein of known 3D
structure?
no
3D comparative modeling
Predicted three dimensional structure
Relation-ship to known structure?
yes
no
3D analysis in laboratory
Is there a predicted structure?
Structural analysis
no
5Secondary structure prediction
sequence
Sliding window tries to predict secondary
structure of amino acid in middle of window
13-17
METHODS
- score types of amino acids in window Chou
Fasman and GOR methods - neural networks
- nearest neighbor methods
6Chou Fasman and GOR
- Secondary structure of middle amino acid scored
in sequence window for known structures - scoring system made for each amino acid in each
type of structure (helix, strand, or loop) score
(V in helices) freq of V in windows within
helices / freq of all amino acids in helices - Rules used to decide what is predicted for a
particular segment need series of same score
7Neural networks e.g. PHD
Used for general secondary structure prediction
and for prediction of protein class e.g. membrane
proteins
- input layer is the sliding window plus any
homologous sequences - nn is trained on known sequences by adjusting
weights - the hidden layer can detect correlations within
sequence window - output layer fed into another network that keeps
track of sequential predictions
8Nearest neighbor methods
- Make a table of sequence windows from known
structures noting structure of middle aa in
window - Find 50 best alignments of window in test
sequence with this table - Score frequencies of helix, strand, loop in the
middle amino acid position. - Scan sequence for series of high scoring
predictions.
50 matching windows
h
h
h
l
h
s
h
h
Looks like middle amino acid should be in a helix
9Zinc finger
One of the most commonly predicted structures
because of the conserved pattern of Cs and Hs.
10A leucine zipper caused by a repeat of leucines
at every second turn of two antiparallel alpha
helices. The helices bond and can attach to DNA,
cause protein-protein interactions or form a
coiled coil.
11Three dimensional prediction
- Hidden Markov models for a few families and
classes - Threading using structural profiles
- Threading by the contact potential method
- New method - the I chain method
12hidden Markov models
junction
Variable series of match states for a loop
P3
Start of alpha unit
P1
Series of match states for an alpha helix
End of alpha unit
P2
1 - P3
Variable series of match states for a beta turn
1 P2
1 P1 (transitional probability)
This diagram illustrates how to match a sequence
to a model of a set of proteins that have both
sequence and structural similarities. Only the
part of the model region is shown.
13Threading a sequence through structural core
models
First, prepare a series of core models
represented by scoring matrices, HMMs of or some
other model that represents the whole core
sequence.
- Thread the test sequence through the cores to
find a good match - Achieved by aligning the sequence with the
models and screening for a high score or
probability.
14Structural or 3D profile method
- Determine structural parameters for each amino
acid in a core - The parameters include neighbor geometry and
closeness, chemical environment, hyrophobicity,
secondary structures of nearby amino acids, etc. - Based on this analysis, each amino acid in the
core is assigned to one of 18 environmental types - The ability of each of the other 19 amino acids
to fit into this environment is determined - A scoring matrix with gap penalties (a profile)
is then made for each core based on the above. - A test sequence is aligned with the profiles by
dynamic programming
15Contact potential method
- method is like the distance matrix method for
aligning structures
- the new sequence is superimposed on the 2D
representation of each structure - the object is to try and fit the amino acids so
that the distances between adjacent ones are
suitable for van der Waals contacts - this is a form of energy minimzation that itself
is also undergoing development
16The Rosetta Method Using I chains - David Baker
Laboratoryhttp//depts.washington.edu/bmsdwp/
- Structural similarity is conserved more strongly
than sequence similarity - Search same 3D structural folds for distant
sequence similarities - These are found and are short patterns of 3-15
amino acids called I chains
17The Rosetta Method
- Rosetta is based on a picture of protein folding
in which local sequence segments rapidly
alternate between different possible local
structures, and folding occurs when the
conformations and relative orientations of these
local segments combine to form low energy global
structures. D. Baker
18Rosetta Method
- The I chains form a series of structures close
too the optimal, most energetic one - For structure prediction, the object is to
through a sequence for matches to I chains and
then generate a best local match - Sequences searched for matches in a window of 9
amino acids to see if they are represented - A compatible 3D model is then produced
- Rosetta was the best predictor in the CASP4
meeting
19Summary of structure prediction
- The most important predictor is sequence
similarity, even very distant sequence similarity - Some structures are readily predictable e.g.
membrane spanning helices whereas other are much
more difficult to predict - Alpha helices are quite accurately predicted
- All methods suffer from a memory of the training
method the models are overtrained - A truly blind experiment the CASP meetings have
revealed that none of the methods works
particularly well on some new structures -
Rosetta is best so far