Protein domains, function and associated prediction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Protein domains, function and associated prediction

1
Lecture 14
Protein domains, function and associated
prediction
Introduction to Bioinformatics
2
Metabolomics fluxomics
3
Experimental

Structural genomics
Functional genomics
Protein-protein interaction
Metabolic pathways
Expression data

4
Issue when elucidating function experimentally

Typically done through knock-out experiments
Partial information (indirect interactions) and
subsequent filling of the missing steps
Negative results (elements that have been shown
not to interact, enzymes missing in an organism)
Putative interactions resulting from
computational analyses

5
Protein function categories

Catalysis (enzymes)
Binding transport (active/passive)
Protein-DNA/RNA binding (e.g. histones,
transcription factors)
Protein-protein interactions (e.g.
antibody-lysozyme) (experimentally determined by
yeast two-hybrid (Y2H) or bacterial two-hybrid
(B2H) screening )
Protein-fatty acid binding (e.g. apolipoproteins)
Protein small molecules (drug interaction,
structure decoding)
Structural component (e.g. ?-crystallin)
Regulation
Signalling
Transcription regulation
Immune system
Motor proteins (actin/myosin)

6
Catalytic properties of enzymes
Michaelis-Menten equation
Vmax S V -------------------
Km S
Vmax

Km kcat
E S ES E P
E enzyme
S substrate
ES enzyme-substrate complex (transition state)
P product
Km Michaelis constant
Kcat catalytic rate constant (turnover number)
Kcat/Km specificity constant (useful for
comparison)

Moles/s
Vmax/2
Km
S
7
Protein interaction domains
http//pawsonlab.mshri.on.ca/html/domains.html
8
Energy difference upon binding

Examples of protein interactions (and of
functional importance) include
Protein protein (pathway analysis)
Protein small molecules (drug interaction,
structure decoding)
Protein peptides, DNA/RNA
The change in Gibbs Free Energy of the
protein-ligand binding interaction can be
monitored and expressed by the following
equation
? G ? H T ? S
(HEnthalpy, SEntropy and TTemperature)

9
(No Transcript)
10
Protein function

Many proteins combine functions
Some immunoglobulin structures are thought to
have more than 100 different functions (and
active/binding sites)
Alternative splicing can generate (partially)
alternative structures

11
Protein function Interaction
Active site / binding cleft
Shape complementarity
12
Protein function evolution
Chymotrypsin
... to a more elaborate active site with four
different features, all helping to optimise
proteolysis (cleavage)
From a simple ancestral active site for cutting
protein chains...
Gene duplication has resulted in two-domain
protein
13
Protein function evolution
Chymotrypsin
The active site lies between the two domains. It
consists of residues on the same two loops
(firstly between beta-strands 3 and 4, secondly
between beta strands 5 and 6) of each of the two
barrel domains. Four features of the active site
are indicated in the figure.
The Substrate Specificity Pocket
Main Chain Substrate-binding
The Oxyanion Hole (white)
Catalytic triad
Chymotrypsin cleaves peptides at the carboxyl
side of tyrosine, tryptophan, and phenylalanine
because those three amino acids contain phenyl
rings.
14
How to infer function

Experiment
Deduction from sequence
Multiple sequence alignment conservation
patterns
Homology searching
Deduction from structure
Threading
Structure-structure comparison
Homology modelling

15
A domain is a

Compact, semi-independent unit (Richardson,
1981).
Stable unit of a protein structure that can fold
autonomously (Wetlaufer, 1973).
Recurring functional and evolutionary module
(Bork, 1992).
Nature is a tinkerer and not an inventor
(Jacob, 1977).
Smallest unit of function

16
Delineating domains is essential for

Obtaining high resolution structures (x-ray but
particularly NMR size of proteins)
Sequence analysis
Multiple sequence alignment methods
Prediction algorithms (SS, Class,
secondary/tertiary structure)
Fold recognition and threading
Elucidating the evolution, structure and function
of a protein family (e.g. Rosetta Stone method)
Structural/functional genomics
Cross genome comparative analysis

17
Domain connectivity
linker
18
Structural domain organisation can be nasty
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel catalytic
substrate binding domain a/b nucleotide binding
domain
1 continuous 2 discontinuous domains
19
Domain size

The size of individual structural domains varies
widely
from 36 residues in E-selectin to 692 residues in
lipoxygenase-1 (Jones et al., 1998)
the majority (90) having less than 200 residues
(Siddiqui and Barton, 1995)
with an average of about 100 residues (Islam et
al., 1995).
Small domains (less than 40 residues) are often
stabilised by metal ions or disulphide bonds.
Large domains (greater than 300 residues) are
likely to consist of multiple hydrophobic cores
(Garel, 1992).

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
Analysis of chain hydrophobicity in multidomain
proteins
28
Analysis of chain hydrophobicity in multidomain
proteins
29
Domain characteristics

Domains are genetically mobile units, and
multidomain families are found in all three
kingdoms (Archaea, Bacteria and Eukarya)
underlining the finding that Nature is a
tinkerer and not an inventor (Jacob, 1977).
The majority of genomic proteins, 75 in
unicellular organisms and more than 80 in
metazoa, are multidomain proteins created as a
result of gene duplication events (Apic et al.,
2001).
Domains in multidomain structures are likely to
have once existed as independent proteins, and
many domains in eukaryotic multidomain proteins
can be found as independent proteins in
prokaryotes (Davidson et al., 1993).

30
Protein function evolution- Gene (domain)
duplication -
Active site
Chymotrypsin
31
Pyruvate phosphate dikinase

3-domain protein
Two domains catalyse 2-step reaction
A? B ? C
Third so-called swivelling domain actively
brings intermediate enzymatic product (B) over
45Å from one active site to the other

/
32
Pyruvate phosphate dikinase

3-domain protein
Two domains catalyse 2-step reaction
A? B ? C
Third so-called swivelling domain actively
brings intermediate enzymatic product (B) over
45Å from one active site to the other

The DEATH Domain
Present in a variety of Eukaryotic proteins
involved with cell death.
Six helices enclose a tightly packed hydrophobic
core.
Some DEATH domains form homotypic and
heterotypic dimers.

http//www.mshri.on.ca/pawson
34
Detecting Structural Domains

A structural domain may be detected as a compact,
globular substructure with more interactions
within itself than with the rest of the structure
(Janin and Wodak, 1983).
Therefore, a structural domain can be determined
by two shape characteristics compactness and its
extent of isolation (Tsai and Nussinov, 1997).
Measures of local compactness in proteins have
been used in many of the early methods of domain
assignment (Rossmann et al., 1974 Crippen, 1978
Rose, 1979 Go, 1978) and in several of the more
recent methods (Holm and Sander, 1994 Islam et
al., 1995 Siddiqui and Barton, 1995 Zehfus,
1997 Taylor, 1999).

35
Detecting Structural Domains

However, approaches encounter problems when faced
with discontinuous or highly associated domains
and many definitions will require manual
interpretation.
Consequently there are discrepancies between
assignments made by domain databases (Hadley and
Jones, 1999).

36
Detecting Domains using Sequence only

Even more difficult than prediction from
structure!

37
Integrating protein multiple sequence alignment,
secondary and tertiary structure prediction in
order to predict structural domain boundaries in
sequence data
SnapDRAGON

Richard A. George
George R.A. and Heringa, J. (2002) J. Mol. Biol.,
316, 839-851.

38
Protein structure hierarchical levels
39
Protein structure hierarchical levels
40
Protein structure hierarchical levels
41
Protein structure hierarchical levels
42
SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)

Input Multiple sequence alignment (MSA) and
predicted secondary structure
Generate 100 DRAGON 3D models for the protein
structure associated with the MSA
Assign domain boundaries to each of the 3D models
(Taylor, 1999)
Sum proposed boundary positions within 100 models
along the length of the sequence, and smooth
boundaries using a weighted window

George R.A. and Heringa J.(2002) SnapDRAGON - a
method to delineate protein structural domains
from sequence data, J. Mol. Biol. 316, 839-851.
43
SnapDragon
Folds generated by Dragon
Multiple alignment
Boundary recognition (Taylor, 1999)
Predicted secondary structure
Summed and Smoothed Boundaries
CCHHHCCEEE
44
SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)

Input Multiple sequence alignment (MSA)
Sequence searches using PSI-BLAST (Altschul et
al., 1997)
followed by sequence redundancy filtering using
OBSTRUCT (Heringa et al.,1992)
and alignment by PRALINE (Heringa, 1999)
and predicted secondary structure
PREDATOR secondary structure prediction program

George R.A. and Heringa J.(2002) SnapDRAGON - a
method to delineate protein structural domains
from sequence data, J. Mol. Biol. 316, 839-851.
45
Domain prediction using DRAGON
Distance Regularisation Algorithm for Geometry
OptimisatioN
(Aszodi Taylor, 1994)

Folded protein models based on the requirement
that (conserved) hydrophobic residues cluster
together.
First construct a random high dimensional Ca
distance matrix.
Distance geometry is used to find the 3D
conformation corresponding to a prescribed target
matrix of desired distances between residues.

46
SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)

Generate 100 DRAGON (Aszodi Taylor, 1994)
models for the protein structure associated with
the MSA
DRAGON folds proteins based on the requirement
that (conserved) hydrophobic residues cluster
together
(Predicted) secondary structures are used to
further estimate distances between residues (e.g.
between the first and last residue in a
b-strand).
It first constructs a random high dimensional Ca
(and pseudo Cb) distance matrix
Distance geometry is used to find the 3D
conformation corresponding to a prescribed matrix
of desired distances between residues (by gradual
inertia projection and based on input MSA and
predicted secondary structure)
DRAGON Distance Regularisation Algorithm for
Geometry OptimisatioN

47
Multiple alignment
C? distance matrix
Target matrix
Predicted secondary structure
N
N
3
N
N
100 randomised initial matrices 100 predictions
CCHHHCCEEE
Input data
N

The C? distance matrix is divided into smaller
clusters.
Separately, each cluster is embedded into a local
centroid.
The final predicted structure is generated from
full embedding of the multiple centroids and
their corresponding local structures.

48
Lysozyme 4lzm
PDB
DRAGON
49
Methyltransferase 1sfe
PDB
DRAGON
50
Phosphatase 2hhm-A
PDB
DRAGON
51
Taylor method (1999)DOMAIN-3D

3. Assign domain boundaries to each of the 3D
models (Taylor, 1999)
Easy and clever method
Uses a notion of spin glass theory (disordered
magnetic systems) to delineate domains in a
protein 3D structure
Steps
Take sequence with residue numbers (1..N)
Look at neighbourhood of each residue (first
shell)
If (average nghhood residue number gt res no)
resno resno1
else resno resno-1
If (convergence) then take regions with identical
residue number as domains and terminate

Taylor,WR. (1999) Protein structural domain
identification. Protein Engineering 12 203-216
52
Taylor method (1999)
repeat until convergence if 41 lt
(56567889)/5 then Res 41 42 (up 1)
else Res 41 40 (down 1)
5
78
6
41
56
89
53
Taylor method (1999)
initial situation
Iterate until convergence
continuous
discontinuous
54
SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)

Sum proposed boundary positions within 100 models
along the length of the sequence, and smooth
boundaries using a weighted window (assign
central position)
Window score ?1 i l Si Wi
Where Wi (p - p-i)/p2 and p ½(n1).
It follows that ?l Wi 1

Wi
i
George R.A. and Heringa J.(2002) SnapDRAGON - a
method to delineate protein structural domains
from sequence data, J. Mol. Biol. 316, 839-851.
55
SNAPDRAGON

Statistical significance
Convert peak scores to Z-scores using
z (x-mean)/stdev
If z gt 2 then assign domain boundary
Statistical significance using random models
Test hydrophibic collapse given distribution of
hydrophobicity over sequence
Make 5 scrambled multiple alignments (MSAs) and
predict their secondary structure
Make 100 models for each MSA
Compile mean and stdev from the boundary
distribution over the 500 random models
If observed peak z gt 2.0 stdev (from random
models) then assign domain boundary

56
SnapDRAGON prediction assessment

Test set of 414 multiple alignments183 single
and 231 multiple domain proteins.
Boundary predictions are compared to the region
of the protein connecting two domains (maximally
?10 residues from true boundary)

57
SnapDRAGON prediction assessment

Baseline method I
Divide sequence in equal parts based on number of
domains predicted by SnapDRAGON
Baseline method II
Similar to Wheelan et al., based on domain length
partition density function (PDF)
PDF derived from 2750 non-redundant structures
(deposited at NCBI)
Given sequence, calculate probability of
one-domain, two-domain, .., protein
Highest probability taken and sequence split
equally as in baseline method I

58
Average prediction results per protein
Coverage is the linkers predicted
(TP/TPFN) Success is the of correct
predictions made (TP/TPFP)
59
Average prediction results per protein
60
Protein-protein interaction networks
61
Protein Function Prediction

How can we get the edges (connections) of the
cellular networks?
We can predict functions of genes or proteins so
we know where they would fit in a metabolic
network
There are also techniques to predict whether two
proteins interact, either functionally (e.g. they
are involved in a two-step metabolic process) or
directly physically (e.g. are together in a
protein complex)

62
Protein Function Prediction
The state of the art its not complete Many
genes are not annotated, and many more are
partially or erroneously annotated. Given a
genome which is partially annotated at best, how
do we fill in the blanks? Of each sequenced
genome, 20-50 of the functions of proteins
encoded by the genomes remains unknown! How then
do we build a reasonably complete networks when
the parts list is so incomplete?
63
Protein Function Prediction
For all these reasons, improving automated
protein function prediction is now a cornerstone
of bioinformatics and computational biology New
methods will need to integrate signals coming
from sequence, expression, interaction and
structural data, etc.
64
Classes of function prediction methods (recap)

Sequence based approaches
protein A has function X, and protein B is a
homolog (ortholog) of protein A Hence B has
function X
Structure-based approaches
protein A has structure X, and X has so-so
structural features Hence As function sites are
.
Motif-based approaches
a group of genes have function X and they all
have motif Y protein A has motif Y Hence
protein As function might be related to X
Function prediction based on guilt-by-association
gene A has function X and gene B is often
associated with gene A, B might have function
related to X

65
Phylogenetic profile analysis

Function prediction of genes based on
guilt-by-association a non-homologous
approach
The phylogenetic profile of a protein is a string
that encodes the presence or absence of the
protein in every sequenced genome
Because proteins that participate in a common
structural complex or metabolic pathway are
likely to co-evolve, the phylogenetic profiles of
such proteins are often similar'
This means that such proteins have a good chance
of being physically or metabolically connected

66
Phylogenetic profile analysis

Phylogenetic profile (against N genomes)
For each gene X in a target genome (e.g., E
coli), build a phylogenetic profile as follows
If gene X has a homolog in genome i, the ith bit
of Xs phylogenetic profile is 1 otherwise it
is 0

67
Phylogenetic profile analysis

Example phylogenetic profiles based on 60
genomes

genome
gene
orf1034111011011001011111010001010000000011110001
1111110110111010101 orf10361011110001000001010000
010010000000010111101110011011010000101 orf103711
01100110000001110010000111111001101111101011101111
000010100 orf103811101001100100101100100111000001
01110101101111111111110000101 orf1039111111111111
1111111111111111111111111111101111111111111111101
orf104 10001010000000000000001010000000001100000
00000000100101000100 orf1040111011111111110111110
1111100000111111100111111110110111111101 orf10411
11111111111111111011111111111110111111110111111111
1111111101 orf10421110100101010010010110000100001
001111110111110101101100010101 orf104311101001100
10000010100111100100001111110101111011101000010101
orf104411111001111100100101110101111110011111111
11111101101100010101 orf1045111111011011001111111
1111111111101111111101111111111110010101 orf10460
10110000001000101100000011111000001010000000101001
0100000000 orf10470000000000000001000010000001000
100000000000000010000000000000 orf105
01101101101000101111011010101110011011001011111000
10000010001 orf1054010010011000000110000100010000
0000100100100001000100100000000
By correlating the rows (open reading frames
(ORF) or genes) you find out about joint presence
or absence of genes this is a signal for a
functional connection
Genes with similar phylogenetic profiles have
related functions or functionally linked D
Eisenberg and colleagues (1999)
68
Phylogenetic profile analysis

Phylogenetic profiles contain great amount of
functional information
Phlylogenetic profile analysis can be used to
distinguish orthologous genes from paralogous
genes
Example Subcellular localization 361 yeast
nucleus-encoded mitochondrial proteins were
identified at 50 accuracy with 58 coverage
through phylogenetic profile analysis
Functional complementarity By examining inverse
phylogenetic profiles, one can find functionally
complementary genes that might have evolved
through one of several mechanisms of convergent
evolution.
Phylogenetic profiling typically has low accuracy
(specificity) but can have high coverage.

69
Domain fusion example

Vertebrates have a multi-enzyme protein
(GARs-AIRs-GARt) comprising the enzymes GAR
synthetase (GARs), AIR synthetase (AIRs), and GAR
transformylase (GARt)
In insects, the polypeptide appears as
GARs-(AIRs)2-GARt
In yeast, GARs-AIRs is encoded separately from
GARt
In bacteria each domain is encoded separately
(Henikoff et al., 1997).
GAR glycinamide ribonucleotide
AIR aminoimidazole ribonucleotide

70
Using observed domain fusion for prediction of
protein-protein interactions Rosetta stone
method

Gene fusion is the an effective method for
prediction of protein-protein interactions
If proteins A and B are homologous to two domains
of a multi-domain protein C, A and B are
predicted to have interaction

A
B
C
Though gene-fusion has low prediction coverage,
its false-positive rate is low (high specificity)
71
Protein interaction database

There are numerous databases of protein-protein
interactions
DIP is a popular protein-protein interaction
database

The DIP database catalogs experimentally
determined interactions between proteins. It
combines information from a variety of sources to
create a single, consistent set of
protein-protein interactions.
72
Protein interaction databases

BIND - Biomolecular Interaction Network Database
DIP - Database of Interacting Proteins
PIM Hybrigenics
PathCalling Yeast Interaction Database
MINT - a Molecular Interactions Database
GRID - The General Repository for Interaction
Datasets
InterPreTS - protein interaction prediction
through tertiary structure
STRING - predicted functional associations among
genes/proteins
Mammalian protein-protein interaction database
(PPI)
InterDom - database of putative interacting
protein domains
FusionDB - database of bacterial and archaeal
gene fusion events
IntAct Project
The Human Protein Interaction Database (HPID)
ADVICE - Automated Detection and Validation of
Interaction by Co-evolution
InterWeaver - protein interaction reports with
online evidence
PathBLAST - alignment of protein interaction
networks
ClusPro - a fully automated algorithm for
protein-protein docking
HPRD - Human Protein Reference Database

73
Protein interaction database
74
Network of protein interactions and predicted
functional links involving silencing information
regulator (SIR) proteins. Filled circles
represent proteins of known function open
circles represent proteins of unknown function,
represented only by their Saccharomyces genome
sequence numbers ( http//genome-www.stanford.edu/
Saccharomyces). Solid lines show experimentally
determined interactions, as summarized in the
Database of Interacting Proteins19
(http//dip.doe-mbi.ucla.edu). Dashed lines show
functional links predicted by the Rosetta Stone
method12. Dotted lines show functional links
predicted by phylogenetic profiles16. Some
predicted links are omitted for clarity.
75
Network of predicted functional linkages
involving the yeast prion protein20 Sup35. The
dashed line shows the only experimentally
determined interaction. The other functional
links were calculated from genome and expression
data11 by a combination of methods, including
phylogenetic profiles, Rosetta stone linkages and
mRNA expression. Linkages predicted by more than
one method, and hence particularly reliable, are
shown by heavy lines. Adapted from ref. 11.
76
STRING - predicted functional associations among
genes/proteins

STRING is a database of predicted functional
associations among genes/proteins.
Genes of similar function tend to be maintained
in close neighborhood, tend to be present or
absent together, i.e. to have the same
phylogenetic occurrence, and can sometimes be
found fused into a single gene encoding a
combined polypeptide.
STRING integrates this information from as many
genomes as possible to predict functional links
between proteins.

Berend Snel (UU), Martijn Huynen (RUN) and the
group of Peer Bork (EMBL, Heidelberg)
77
STRING - predicted functional associations among
genes/proteins

STRING is a database of known and predicted
protein-protein interactions.The interactions
include direct (physical) and indirect
(functional) associations they are derived from
four sources
Genomic Context (Synteny)
High-throughput Experiments
(Conserved) Co-expression
Previous Knowledge
STRING quantitatively integrates interaction
data from these sources for a large number of
organisms, and transfers information between
these organisms where applicable. The database
currently contains 736429 proteins in 179 species

78
STRING - predicted functional associations among
genes/proteins
Conserved Neighborhood This view shows
runs of genes that occur repeatedly in close
neighborhood in (prokaryotic) genomes. Genes
located together in a run are linked with a black
line (maximum allowed intergenic distance is 300
bp). Note that if there are multiple runs for a
given species, these are separated by white
space. If there are other genes in the run that
are below the current score threshold, they are
drawn as small white triangles. Gene fusion
occurences are also drawn, but only if they are
present in a run.
79
Wrapping up

Understand chymotrypsin example evolution via
gene duplication of an optimised two-domain
barrel enzyme with active site residues from
either domain.
Understand domain issues structural and
functional
Understand the basic steps of the Snap-DRAGON
method for domain boundary prediction but no
need to memorize it all
Understand phylogenetic profiling and the Rosetta
Stone method (guilt-by-association)
Understand that conservation patterns in the
order of genes that are nearby on the genome
(synteny) indicate functional relationships (used
in STRING method)
Also co-expression (genes being expressed (or
not) at the same time) indicates a functional
relationship (used in STRING method)

Write a Comment

User Comments (0)

About PowerShow.com

Protein domains, function and associated prediction PowerPoint PPT Presentation