Title: Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families
1Bioinformatics master courseDNA/Protein
structure-function analysis and prediction
Lecture 5 Protein Fold Families
- Centre for Integrative Bioinformatics VU (IBIVU)
- Faculty of Sciences / Faculty of Earth Life
Sciences
2Protein structure evolution
Insertion/deletion of secondary structural
elements can easily be done at loop sites
3Protein structure evolution
Insertion/deletion of structural domains can
easily be done at loop sites
4Fold classification
- four broad structural protein fold classes
- all-a
- all-ß
- a/ß (a mixed with ß),
- aß (separated a and ß regions)
5The first protein structure in 1960 Myoglobin -
? fold
6There are a number of examples of small proteins
(or peptides) which consist of little more than a
single helix. A striking example is alamethicin,
a transmembrane voltage gated ion channel, acting
as a peptide antibiotic.
7Coiled-coil domains
This long protein is involved In muscle
contraction
Tropomyosin
8Alpha-helix interaction
Two helix interface areas should have
complementary surfaces. a-helix surface can be
thought of as consisting of grooves and ridges,
like a screw thread for instance, the side
chains of every 4th residue form a i4 ridge
(because there are 3.6 residues per turn). The
direction of this ridge is 26 from the direction
of the helix axis. Therefore if 2 helices pack
such that such a ridge from each fits into the
other's groove, the expected angle between the
two is 52. In fact, in the observed distribution
of this angle between packed alpha-helices, there
is a sharp peak at 50. Ridges can also be
formed by other stacking patterns of residues,
such as every 3rd residue, or indeed every
residue. The "i4" ridge is believed to be the
most common because residues at every 4th
position have side-chains which are more closely
aligned than in "i3" or "i1" ridges as
indicated below.
http//swissmodel.expasy.org/course/text/chapter4.
htm
9Helix-turn-helix and 4-helix bundles
Here is a diagram of Interleukin-2, human Growth
Hormone, Granulocyte-macrophage
colony-stimulating factor (GM-CSF) and
Interleukin-4.
10Beta-proteins
11Beta-sheet structures
porin
12Greek key ?-strand motif
13Greek key ?-strand motif
Structure gamma-crystallin
14?/? fold
Flavodoxin fold
5(??) fold
15?/? fold
Flavodoxin family - TOPS diagrams (Flores et
al., 1994)
16Beta-alpha-beta structures
17Alpha-beta barrel
18Plait motif
19?? 3-layer motifs (2 layers of helices with a
?-sheet in between) are often specified as
x-y-z (e.g. 4-14-5), where x is number of
helices in the first helical layer, y is number
of strands in the ?-sheet, and y is number of
helices in the second helical layer
20For ? proteins, there are no good classification
systems. You can only count
21How many folds Chothia 1992
The first estimate of the number of protein
families has been explicitly done by Chothia in
1992. At that time about 120 structural families
were known. Chothia summarized the results of
several genome projects and revealed that the
chances of a random protein to belong to one of
the known sequence families is approximately 1/3.
According to the results of sequence comparison
of the PDB with sequence databases (Sander,
Schneider 1991), about 1/4 of all sequences
appeared to be similar to one of the PDB entries
at 25 identity level. Assuming equal
distribution of proteins among the families,
Chothia concluded that the total number of
protein structural families should be equal to
12034 1440.
22How many folds Alexandrov Go, 1994, updated
Pfam-2.1 database consists of 101,724 domains of
proteins from SwissProt (Bairoch R., 1996)
release 34, clustered in 13,816 families. There
were also 7,694 proteins of 30 or more amino
acids in SwissProt-34, which are not present in
Pfam and are not similar to other proteins. We
have added them into the database, which now
contains 109,418 domains in 21,510 families. We
have eliminated very similar sequences from the
database, trying to make the database more
homogeneous. In the final classification there
were 60,601 domains, distributed within 21,510
families. All families were ranked by the number
of sequences in each family. The resulting
distribution fits nicely to the Zipfs law
(http//wwww.bionet.nsc.ru/bgrs/thesis/100/)
23How many folds
r is the rank of family, n(r) is the number of
proteins in the r-th family, a is a scaling
constant, depending on the number of proteins in
the dataset, and b 0.64. Constant b does not
depend on the size of the dataset.
n(r) ar-b
24How many folds (cont.)
Distribution of protein sequences among protein
families. One can see that the distribution is
essentially non-equal. The shape of the
distribution is described very well by Zipfs
law n(r) ar-b, with a 640 and b0.64. The
correlation coefficient of this approximation
equals to 0.992.
25Fold number according to Alexandrov Go
60,000 protein sequence families in 14,000
different folds
26Fold number according to Alexandrov Go
An important feature of Zipfs distribution is
that it has a very long tail of clusters with
only few members in it. For example, if b0.7,
half of all proteins is located in 10 of all
clusters.
27General fold classification systems
The definitions of four broad structural classes,
all-a, all-ß, a/ß, and aß, based on secondary
structure compositions and ß-sheet topologies
Levitt Chothia, 1976 represented the first
step towards a global characterization of the
protein fold space. These definitions have been
generally accepted and are being used by many
classification systems to organize the fold
hierarchy Murzin et al., 1995 Orengo et al.,
1997. However, there is a need for methods to
represent the full range of structural
relationships among folds for a better
understanding of the organizing principles and
features of the protein fold space.
28General fold classification systems(cont.)
The fold family trees such as those built by
Effimov 1997, Zhang and Kim 2000 and Taylor
2002 are very informative, but the construction
of such trees involves extensive manual
operations and, sometimes, considerable human
judgment. An alternative approach is to apply a
uniform measure of the structural similarity
across all fold types and map the structural
relationships into a low dimensional space. Two
such maps have been introduced, one is
represented in the CATH database by Orengo and
colleages 1997 and the other in the DALI
database by Holm and Sander 1993. Although the
two maps are based on different structural
alignment algorithms and multivariant analysis
methods, they give similar two-dimensional
projections featuring three large clusters
corresponding to a, ß, and a/ß folds,
respectively.
29General fold classification system references
Levitt, M. and C. Chothia, Structural patterns in
globular proteins. Nature, 1976. 261(5561) p.
552-8. Murzin, A.G., et al., SCOP a structural
classification of proteins database for the
investigation of sequences and structures. J Mol
Biol, 1995. 247(4) p. 536-40. Orengo, C.A., et
al., CATH--a hierarchic classification of protein
domain structures. Structure, 1997. 5(8) p.
1093-108. Taylor, W.R., A 'periodic table' for
protein structures. Nature, 2002. 416(6881) p.
657-60. Orengo, C.A., et al., Identification and
classification of protein fold families. Protein
Eng, 1993. 6(5) p. 485-500.
30General fold classification system references
(cont.)
Efimov, A.V., Structural trees for protein
superfamilies. Proteins, 1997. 28(2) p.
241-60. Zhang, C. and S.H. Kim, A comprehensive
analysis of the Greek key motifs in protein
beta-barrels and betasandwiches. Proteins, 2000.
40(3) p. 409-19. Holm, L. and C. Sander,
Protein structure comparison by alignment of
distance matrices. J Mol Biol, 1993. 233(1) p.
123-38.
31Fold distribution
Metric matrix distance geometry method applied to
all pair-wise distances (structural
dissimilarities) to assign three-dimensional
coordinates to a set of 498 SCOP folds such that
the relative distance between two folds is
inversely correlated with the DALI alignment
score. The results of the mapping are shown in
the figure on the left.
32The first 20 eigen values of the metric matrix
calculated from the 498x498 DALI structural
alignment scores.
33Plotting the first 3 eigenvectors i.e., the
eigenvectors corresponding to the three largest
eigenvalues. Again, notice the segregation of the
four main structural classes..
34The same as the preceding slide, but from another
angle
35Comparing fold usage between two species in the
eubacterial domain (Chlamydia versus Aquifex, A)
and between those of two different domains
(Chlamydia of bacteria versus Halobacterium of
archaea, B). The usages of the 498 folds by the
second organism are subtracted from the fold
usages by the first organism. A contour surface
(mesh) is then constructed and set at the values
of 0.4 for blue and 0.4 for red. Regions
within the blue contour include folds that appear
more frequently in the first organism, whereas
regions within the red contour include folds that
occur more frequently in the second organism.
36CATH database
Classification Architecture Topology Homologous
family
37CATH database
38Structural Classification of proteins (SCOP)
database
- All alpha proteins
- All beta proteins
- Alpha and beta proteins (a/b) - Mainly parallel
beta sheets (beta-alpha-beta units) - Alpha and beta proteins (ab) - Mainly
antiparallel beta sheets (segregated alpha
and beta regions) - Multi-domain proteins (alpha and beta) - Folds
consisting of two or more domains belonging to
different classes - Membrane and cell surface proteins and peptides
No proteins in the immune system
39Structural Classification of proteins (SCOP)
database (cont.)
- Small proteins - Usually dominated by metal
ligand, heme, and/or disulfide bridges - Coiled coil proteins - Not a true class
- Low resolution structures - Not a true class
- Peptides - Peptides and fragments. Not a true
class - Designed proteins - Experimental structures of
proteins with essentially non-natural
sequences. Not a true class
40SCOP
- Gold standard of protein classification
- In essence, the work of a single man (Alexei
Murzin) - The classification has been constructed manually
by visual inspection and comparison of
structures, but with the assistance of tools to
make the task manageable and help provide
generality.
41SCOP
- The different major levels in the hierarchy are
- Family Clear evolutionarily relationshipProteins
clustered together into families are clearly
evolutionarily related. Generally, this means
that pairwise residue identities between the
proteins are 30 and greater. However, in some
cases similar functions and structures provide
definitive evidence of common descent in the
absense of high sequence identity for example,
many globins form a family though some members
have sequence identities of only 15.
42SCOP
- The different major levels in the hierarchy are
- Superfamily Probable common evolutionary
originProteins that have low sequence
identities, but whose structural and functional
features suggest that a common evolutionary
origin is probable are placed together in
superfamilies. For example, actin, the ATPase
domain of the heat shock protein, and hexakinase
together form a superfamily.
43SCOP
- The different major levels in the hierarchy are
- Fold Major structural similarityProteins are
defined as having a common fold if they have the
same major secondary structures in the same
arrangement and with the same topological
connections. Different proteins with the same
fold often have peripheral elements of secondary
structure and turn regions that differ in size
and conformation. In some cases, these differing
peripheral regions may comprise half the
structure. Proteins placed together in the same
fold category may not have a common evolutionary
origin the structural similarities could arise
just from the physics and chemistry of proteins
favouring certain packing arrangements and chain
topologies.
44DALI database
- Based upon the DALI method for structural
superpositioning. The programme optimises the
overlay of distance plots (see next slide) - Fully automatic
- Database contains clusters of protein families
(e.g. a giant PDB structures tree) and structural
alignments - Database is consistent, but grouping is not done
manually by experts
45DALI databaseContact Maps
Fig (c) contact map of ROP (lower) and 256B
(upper triangle). Fig (d) Collapsed ROP
(lower) and difference contact plot (upper
triangle)
Figures (c) and (d)..
46PROTOMAP database (Linial et al.)
- Number of proteins in DB (May 2000) is 365174
(341645 after merging identical entries), number
of cluster is 18140, number of singletons is
43219 (of which 14384 are satellites of other
clusters) - Provides software to group new protein sequences
- Fully automatic
- Classifies UniProt TrEMBL (translated EMBL)
databases
47Folds how many?
- Chothia (1992) appr. 1,000 folds
- Estimates vary from 1,000 15,000
- With 30,000 human genes, 3 genes per fold on
average (but think about alternative splicing) -
Chothia, C., Proteins. One thousand families for
the molecular biologist. Nature, 1992. 357(6379)
p. 543-4. Zhang, C. and C. DeLisi, Estimating the
number of protein folds. J Mol Biol, 1998.
284(5) p. 1301-5.