Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families

About This Presentation

Title:

Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families

Description:

Title: PowerPoint Presentation Author: heringa Last modified by: heringa Created Date: 2/3/2003 6:56:45 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:198

Avg rating:3.0/5.0

Slides: 48

Provided by: heringa

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families

1
Bioinformatics master courseDNA/Protein
structure-function analysis and prediction
Lecture 5 Protein Fold Families

Centre for Integrative Bioinformatics VU (IBIVU)
Faculty of Sciences / Faculty of Earth Life
Sciences

2
Protein structure evolution
Insertion/deletion of secondary structural
elements can easily be done at loop sites
3
Protein structure evolution
Insertion/deletion of structural domains can
easily be done at loop sites
4
Fold classification

four broad structural protein fold classes
all-a
all-ß
a/ß (a mixed with ß),
aß (separated a and ß regions)

5
The first protein structure in 1960 Myoglobin -
? fold
6
There are a number of examples of small proteins
(or peptides) which consist of little more than a
single helix. A striking example is alamethicin,
a transmembrane voltage gated ion channel, acting
as a peptide antibiotic.
7
Coiled-coil domains
This long protein is involved In muscle
contraction
Tropomyosin
8
Alpha-helix interaction
Two helix interface areas should have
complementary surfaces. a-helix surface can be
thought of as consisting of grooves and ridges,
like a screw thread for instance, the side
chains of every 4th residue form a i4 ridge
(because there are 3.6 residues per turn). The
direction of this ridge is 26 from the direction
of the helix axis. Therefore if 2 helices pack
such that such a ridge from each fits into the
other's groove, the expected angle between the
two is 52. In fact, in the observed distribution
of this angle between packed alpha-helices, there
is a sharp peak at 50. Ridges can also be
formed by other stacking patterns of residues,
such as every 3rd residue, or indeed every
residue. The "i4" ridge is believed to be the
most common because residues at every 4th
position have side-chains which are more closely
aligned than in "i3" or "i1" ridges as
indicated below.
http//swissmodel.expasy.org/course/text/chapter4.
htm
9
Helix-turn-helix and 4-helix bundles
Here is a diagram of Interleukin-2, human Growth
Hormone, Granulocyte-macrophage
colony-stimulating factor (GM-CSF) and
Interleukin-4.
10
Beta-proteins
11
Beta-sheet structures
porin
12
Greek key ?-strand motif
13
Greek key ?-strand motif
Structure gamma-crystallin
14
?/? fold
Flavodoxin fold
5(??) fold
15
?/? fold
Flavodoxin family - TOPS diagrams (Flores et
al., 1994)
16
Beta-alpha-beta structures
17
Alpha-beta barrel
18
Plait motif
19
?? 3-layer motifs (2 layers of helices with a
?-sheet in between) are often specified as
x-y-z (e.g. 4-14-5), where x is number of
helices in the first helical layer, y is number
of strands in the ?-sheet, and y is number of
helices in the second helical layer
20
For ? proteins, there are no good classification
systems. You can only count
21
How many folds Chothia 1992
The first estimate of the number of protein
families has been explicitly done by Chothia in
1992. At that time about 120 structural families
were known. Chothia summarized the results of
several genome projects and revealed that the
chances of a random protein to belong to one of
the known sequence families is approximately 1/3.
According to the results of sequence comparison
of the PDB with sequence databases (Sander,
Schneider 1991), about 1/4 of all sequences
appeared to be similar to one of the PDB entries
at 25 identity level. Assuming equal
distribution of proteins among the families,
Chothia concluded that the total number of
protein structural families should be equal to
12034 1440.
22
How many folds Alexandrov Go, 1994, updated
Pfam-2.1 database consists of 101,724 domains of
proteins from SwissProt (Bairoch R., 1996)
release 34, clustered in 13,816 families. There
were also 7,694 proteins of 30 or more amino
acids in SwissProt-34, which are not present in
Pfam and are not similar to other proteins. We
have added them into the database, which now
contains 109,418 domains in 21,510 families. We
have eliminated very similar sequences from the
database, trying to make the database more
homogeneous. In the final classification there
were 60,601 domains, distributed within 21,510
families. All families were ranked by the number
of sequences in each family. The resulting
distribution fits nicely to the Zipfs law
(http//wwww.bionet.nsc.ru/bgrs/thesis/100/)
23
How many folds
r is the rank of family, n(r) is the number of
proteins in the r-th family, a is a scaling
constant, depending on the number of proteins in
the dataset, and b 0.64. Constant b does not
depend on the size of the dataset.
n(r) ar-b
24
How many folds (cont.)
Distribution of protein sequences among protein
families. One can see that the distribution is
essentially non-equal. The shape of the
distribution is described very well by Zipfs
law n(r) ar-b, with a 640 and b0.64. The
correlation coefficient of this approximation
equals to 0.992.
25
Fold number according to Alexandrov Go
60,000 protein sequence families in 14,000
different folds
26
Fold number according to Alexandrov Go
An important feature of Zipfs distribution is
that it has a very long tail of clusters with
only few members in it. For example, if b0.7,
half of all proteins is located in 10 of all
clusters.
27
General fold classification systems
The definitions of four broad structural classes,
all-a, all-ß, a/ß, and aß, based on secondary
structure compositions and ß-sheet topologies
Levitt Chothia, 1976 represented the first
step towards a global characterization of the
protein fold space. These definitions have been
generally accepted and are being used by many
classification systems to organize the fold
hierarchy Murzin et al., 1995 Orengo et al.,
1997. However, there is a need for methods to
represent the full range of structural
relationships among folds for a better
understanding of the organizing principles and
features of the protein fold space.
28
General fold classification systems(cont.)
The fold family trees such as those built by
Effimov 1997, Zhang and Kim 2000 and Taylor
2002 are very informative, but the construction
of such trees involves extensive manual
operations and, sometimes, considerable human
judgment. An alternative approach is to apply a
uniform measure of the structural similarity
across all fold types and map the structural
relationships into a low dimensional space. Two
such maps have been introduced, one is
represented in the CATH database by Orengo and
colleages 1997 and the other in the DALI
database by Holm and Sander 1993. Although the
two maps are based on different structural
alignment algorithms and multivariant analysis
methods, they give similar two-dimensional
projections featuring three large clusters
corresponding to a, ß, and a/ß folds,
respectively.
29
General fold classification system references
Levitt, M. and C. Chothia, Structural patterns in
globular proteins. Nature, 1976. 261(5561) p.
552-8. Murzin, A.G., et al., SCOP a structural
classification of proteins database for the
investigation of sequences and structures. J Mol
Biol, 1995. 247(4) p. 536-40. Orengo, C.A., et
al., CATH--a hierarchic classification of protein
domain structures. Structure, 1997. 5(8) p.
1093-108. Taylor, W.R., A 'periodic table' for
protein structures. Nature, 2002. 416(6881) p.
657-60. Orengo, C.A., et al., Identification and
classification of protein fold families. Protein
Eng, 1993. 6(5) p. 485-500.
30
General fold classification system references
(cont.)
Efimov, A.V., Structural trees for protein
superfamilies. Proteins, 1997. 28(2) p.
241-60. Zhang, C. and S.H. Kim, A comprehensive
analysis of the Greek key motifs in protein
beta-barrels and betasandwiches. Proteins, 2000.
40(3) p. 409-19. Holm, L. and C. Sander,
Protein structure comparison by alignment of
distance matrices. J Mol Biol, 1993. 233(1) p.
123-38.
31
Fold distribution
Metric matrix distance geometry method applied to
all pair-wise distances (structural
dissimilarities) to assign three-dimensional
coordinates to a set of 498 SCOP folds such that
the relative distance between two folds is
inversely correlated with the DALI alignment
score. The results of the mapping are shown in
the figure on the left.
32
The first 20 eigen values of the metric matrix
calculated from the 498x498 DALI structural
alignment scores.
33
Plotting the first 3 eigenvectors i.e., the
eigenvectors corresponding to the three largest
eigenvalues. Again, notice the segregation of the
four main structural classes..
34
The same as the preceding slide, but from another
angle
35
Comparing fold usage between two species in the
eubacterial domain (Chlamydia versus Aquifex, A)
and between those of two different domains
(Chlamydia of bacteria versus Halobacterium of
archaea, B). The usages of the 498 folds by the
second organism are subtracted from the fold
usages by the first organism. A contour surface
(mesh) is then constructed and set at the values
of 0.4 for blue and 0.4 for red. Regions
within the blue contour include folds that appear
more frequently in the first organism, whereas
regions within the red contour include folds that
occur more frequently in the second organism.
36
CATH database
Classification Architecture Topology Homologous
family
37
CATH database
38
Structural Classification of proteins (SCOP)
database

All alpha proteins
All beta proteins
Alpha and beta proteins (a/b) - Mainly parallel
beta sheets (beta-alpha-beta units)
Alpha and beta proteins (ab) - Mainly
antiparallel beta sheets (segregated alpha
and beta regions)
Multi-domain proteins (alpha and beta) - Folds
consisting of two or more domains belonging to
different classes
Membrane and cell surface proteins and peptides
No proteins in the immune system

39
Structural Classification of proteins (SCOP)
database (cont.)

Small proteins - Usually dominated by metal
ligand, heme, and/or disulfide bridges
Coiled coil proteins - Not a true class
Low resolution structures - Not a true class
Peptides - Peptides and fragments. Not a true
class
Designed proteins - Experimental structures of
proteins with essentially non-natural
sequences. Not a true class

40
SCOP

Gold standard of protein classification
In essence, the work of a single man (Alexei
Murzin)
The classification has been constructed manually
by visual inspection and comparison of
structures, but with the assistance of tools to
make the task manageable and help provide
generality.

41
SCOP

The different major levels in the hierarchy are
Family Clear evolutionarily relationshipProteins
clustered together into families are clearly
evolutionarily related. Generally, this means
that pairwise residue identities between the
proteins are 30 and greater. However, in some
cases similar functions and structures provide
definitive evidence of common descent in the
absense of high sequence identity for example,
many globins form a family though some members
have sequence identities of only 15.

42
SCOP

The different major levels in the hierarchy are
Superfamily Probable common evolutionary
originProteins that have low sequence
identities, but whose structural and functional
features suggest that a common evolutionary
origin is probable are placed together in
superfamilies. For example, actin, the ATPase
domain of the heat shock protein, and hexakinase
together form a superfamily.

43
SCOP

The different major levels in the hierarchy are
Fold Major structural similarityProteins are
defined as having a common fold if they have the
same major secondary structures in the same
arrangement and with the same topological
connections. Different proteins with the same
fold often have peripheral elements of secondary
structure and turn regions that differ in size
and conformation. In some cases, these differing
peripheral regions may comprise half the
structure. Proteins placed together in the same
fold category may not have a common evolutionary
origin the structural similarities could arise
just from the physics and chemistry of proteins
favouring certain packing arrangements and chain
topologies.

44
DALI database

Based upon the DALI method for structural
superpositioning. The programme optimises the
overlay of distance plots (see next slide)
Fully automatic
Database contains clusters of protein families
(e.g. a giant PDB structures tree) and structural
alignments
Database is consistent, but grouping is not done
manually by experts

45
DALI databaseContact Maps
Fig (c) contact map of ROP (lower) and 256B
(upper triangle). Fig (d) Collapsed ROP
(lower) and difference contact plot (upper
triangle)
Figures (c) and (d)..
46
PROTOMAP database (Linial et al.)

Number of proteins in DB (May 2000) is 365174
(341645 after merging identical entries), number
of cluster is 18140, number of singletons is
43219 (of which 14384 are satellites of other
clusters)
Provides software to group new protein sequences
Fully automatic
Classifies UniProt TrEMBL (translated EMBL)
databases

47
Folds how many?

Chothia (1992) appr. 1,000 folds
Estimates vary from 1,000 15,000
With 30,000 human genes, 3 genes per fold on
average (but think about alternative splicing)

Chothia, C., Proteins. One thousand families for
the molecular biologist. Nature, 1992. 357(6379)
p. 543-4. Zhang, C. and C. DeLisi, Estimating the
number of protein folds. J Mol Biol, 1998.
284(5) p. 1301-5.

Write a Comment

User Comments (0)

About PowerShow.com

Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families - PowerPoint PPT Presentation

Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families

Title: PowerPoint Presentation Author: heringa Last modified by: heringa Created Date: 2/3/2003 6:56:45 PM Document presentation format: On-screen Show – PowerPoint PPT presentation