V8: Reliability of Protein Interaction Networks - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

V8: Reliability of Protein Interaction Networks

Description:

One would like to integrate evidence from many different sources to increase the ... the very good fit of the observed data is astonishing (and may be fortuitous) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 76
Provided by: volkhar
Category:

less

Transcript and Presenter's Notes

Title: V8: Reliability of Protein Interaction Networks


1
V8 Reliability of Protein Interaction Networks
One would like to integrate evidence from many
different sources to increase the predictivity of
true and false protein-protein predictions. ? use
Bayesian approach for integrating interaction
information that allows for the probabilistic
combination of multiple data sets apply to yeast.
Input Approach can be used for combining noisy
genomic interaction data sets.
Normalization Each source of evidence for
interactions is compared against samples of known
positives and negatives (gold-standard).
Output predict for every possible protein pair
likelihood of interaction.
Verification test on experimental interaction
data not included in the gold-standard new TAP
(tandem affinity purification experiments).
Jansen et al. Science 302, 449 (2003)
2
Integration of various information sources
3 different types of data used (i) Interaction
data from high-throughput experiments. These
comprise large-scale two-hybrid screens (Y2H) and
in vivo pull-down experiments. (ii) Other
genomic features expression data, biological
function of proteins (from Gene Ontology
biological process and the MIPS functional
catalog), and data about whether proteins are
essential.
(iii) Gold-standards of known interactions and
noninteracting protein pairs.
Jansen et al. Science 302, 449 (2003)
3
Combination of data sets into probabilistic
interactomes
The 4 interaction data sets from HT experiments
were combined into 1 PIE. The PIE represents a
transformation of the individual binary-valued
interaction sets into a data set where every
protein pair is weighed according to the
likelihood that it exists in a complex.
(B) Combination of data sets into probabilistic
interactomes.
A naïve Bayesian network is used to model the
PIP data. These information sets hardly overlap.
Because the 4 experimental interaction data sets
contain correlated evidence, a fully connected
Bayesian network is used.
Jansen et al. Science 302, 449 (2003)
4
Bayesian Networks
Bayesian networks are probabilistic models that
graphically encode probabilistic dependencies
between random variables.
Y
A directed arc between variables Y and E1
denotes conditional dependency of E1 on Y, as
determined by the direction of the arc.
E1
E3
E2
Bayesian networks also include a quantitative
measure of dependency. For each variable and its
parents this measure is defined using a
conditional probability function or a
table. Here, one such measure is the probability
Pr(E1Y).
5
Bayesian Networks
Together, the graphical structure and the
conditional probability functions/tables
completely specify a Bayesian network
probabilistic model.
Y
This model, in turn, specifies a particular
factorization of the joint probability
distribution function over the variables in the
networks.
E1
E3
E2
Here, Pr(Y,E1,E2,E3) Pr(E1Y) Pr(E2Y) Pr(E3Y)
Pr(Y)
6
Gold-Standard
should be (i) independent from the data sources
serving as evidence (ii) sufficiently large for
reliable statistics (iii) free of systematic bias
(e.g. towards certain types of interactions).
Positives use MIPS (Munich Information Center
for Protein Sequences, HW Mewes) complexes
catalog hand-curated list of complexes (8250
protein pairs that are within the same complex)
from biomedical literature.
Negatives - harder to define - essential for
successful training
Assume that proteins in different compartments do
not interact. Synthesize negatives from lists
of proteins in separate subcellular compartments.
Jansen et al. Science 302, 449 (2003)
7
Measure of reliability likelihood ratio
Consider a genomic feature f expressed in binary
terms (i.e. absent or present). Likelihood
ratio L(f) is defined as
L(f) 1 means that the feature has no
predictability the same number of positives and
negatives have feature f. The larger L(f) the
better its predictability.
Jansen et al. Science 302, 449 (2003)
8
Combination of features
For two features f1 and f2 with uncorrelated
evidence, the likelihood ratio of the combined
evidence is simply the product L(f1,f2)
L(f1) ? L(f2) For correlated evidence L(f1,f2)
cannot be factorized in this way.
Bayesian networks are a formal representation of
such relationships between features. The
combined likelihood ratio is proportional to the
estimated odds that two proteins are in the same
complex, given multiple sources of information.
Jansen et al. Science 302, 449 (2003)
9
Prior and posterior odds
positive a pair of proteins that are in the
same complex. Given the number of positives among
the total number of protein pairs, the prior
odds of finding a positive are
posterior odds odds of finding a positive
after considering N datasets with values f1 ...
fN
The terms prior and posterior refer to the
situation before and after knowing the
information in the N datasets.
Jansen et al. Science 302, 449 (2003)
10
Static naive Bayesian Networks
In the case of protein-protein interaction data,
the posterior odds describe the odds of having a
protein-protein interaction given that we have
the information from the N experiments, whereas
the prior odds are related to the chance of
randomly finding a protein-protein interaction
when no experimental data is known. If Opost gt
1, the chances of having an interaction are
higher than having no interaction.
Jansen et al. Science 302, 449 (2003)
11
Static naive Bayesian Networks
The likelihood ratio L defined as
relates prior and posterior odds according to
Bayes rule
In the special case that the N features are
conditionally independent (i.e. they provide
uncorrelated evidence) the Bayesian network is a
so-called naïve network, and L can be
simplified to
Jansen et al. Science 302, 449 (2003)
12
Computation of prior and posterior odds
L can be computed from contingency tables
relating positive and negative examples with the
N features (by binning the feature values f1 ...
fN into discrete intervals).
Determining the prior odds Oprior is somewhat
arbitrary. It requires an assumption about the
number of positives. Here, 30,000 is taken a
conservative lower bound for the number of
positives (i.e. pairs of proteins that are in the
same complex). Considering that there are ca. 18
million 0.5 N (N 1) possible protein pairs
in total (with N 6000 for yeast),
Opost gt 1 can be achieved with L gt 600.
Jansen et al. Science 302, 449 (2003)
13
Essentiality (PIP)
Consider whether proteins are essential or
non-essential does a deletion mutant where this
protein is knocked out from the genome have the
same phenotype?
It should be more likely that both of 2 proteins
in a complex are essential or non-essential, but
not a mixture of these two attributes. Deletion
mutants of either one protein should impair the
function of the same complex.
Jansen et al. Science 302, 449 (2003)
14
Parameters of the naïve Bayesian Networks (PIP)
Column 1 describes the genomic feature. In the
essentiality data protein pairs can take on 3
discrete values (EE both essential NN both
non-essential NE one essential and one not).
Column 2 gives the number of protein pairs with a
particular feature (i.e. EE) drawn from the
whole yeast interactome (18M pairs).
Columns pos and neg give the overlap of these
pairs with the 8,250 gold-standard positives and
the 2,708,746 gold-standard negatives.
Columns sum(pos) and sum(neg) show how many
gold-standard positives (negatives) are among the
protein pairs with likelihood ratio ? L, computed
by summing up the values in the pos (or neg)
column.
P(feature valuepos) and P(feature valueneg)
give the conditional probabilities of the feature
values and L, the ratio of these two
conditional probabilities.
Jansen et al. Science 302, 449 (2003)
15
mRNA expression data
Proteins in the same complex tend to have
correlated expression profiles. Although large
differences can exist between the mRNA and
protein abundance, protein abundance can be
indirectly and quite crudely measured by the
presence or absence of the corresponding mRNA
transcript.
Experimental data source - time course of
expression fluctuations during the yeast cell
cycle - Rosetta compendium expression profiles
of 300 deletion mutants and cells under chemical
treatments.
Problem both data sets are strongly
correlated. Compute first principal component of
the vector of the 2 correlations. Use this as
independent source of evidence for the P-P
interaction prediction. The first principal
component is a stronger predictor of P-P
interactions that either of the 2 expression
correlation datasets by themselves.
Jansen et al. Science 302, 449 (2003)
16
mRNA expression data
The values for mRNA expression correlation (first
principal component) range on a continuous scale
from -1.0 to 1.0 (fully anticorrelated to fully
correlated). This range was binned into 19
intervals.
Jansen et al. Science 302, 449 (2003)
17
PIP Functional similarity
Quantify functional similarity between two
proteins
- consider which set of functional classes two
proteins share, given either the MIPS or Gene
Ontology (GO) classification system.
- Then count how many of the 18 million protein
pairs in yeast share the exact same functional
classes as well (yielding integer counts between
1 and 18 million). It was binned into 5
intervals.
- In general, the smaller this count, the more
similar and specific is the functional
description of the two proteins.
Jansen et al. Science 302, 449 (2003)
18
PIP Functional similarity
Observation low counts correlate with a higher
chance of two proteins being in the same complex.
But signal (L) is quite weak.
Jansen et al. Science 302, 449 (2003)
19
Calculation of the fully connected Bayesian
network (PIE)
The 3 binary experimental interaction datasets
can be combined in at most 24 16 different ways
(subsets). For each of these 16 subsets, one can
compute a likelihood ratio from the overlap with
the gold-standard positives (pos) and negatives
(neg).
Jansen et al. Science 302, 449 (2003)
20
Distribution of likelihood ratios
Number of protein pairs in the individual
datasets and the probabilistic interactomes as a
function of the likelihood ratio.
There are many more protein pairs with high
likelihood ratios in the probabilistic
interactomes (PIE) than in the individual
datasets G,H,U,I. Protein pairs with high
likelihood ratios provide leads for further
experimental investigation of proteins that
potentially form complexes.
Jansen et al. Science 302, 449 (2003)
21
PIP vs. the information sources
Ratio of true to false positives (TP/FP)
increases monotonically with Lcut. ? L is an
appropriate measure of the odds of a real
interaction. The ratio is computed
as Protein pairs with Lcut gt 600 have a gt
50 chance of being in the same complex.
Jansen et al. Science 302, 449 (2003)
22
PIE vs. the information sources
9897 interactions are predicted from PIP and 163
from PIE. In contrast, likelihood ratios derived
from single genomic factors (e.g. mRNA
coexpression) or from individual interaction
experiments (e.g. the Ho data set) did no exceed
the cutoff when used alone. This demonstrates
that information sources that, taken alone, are
only weak predictors of interactions can yield
reliable predictions when combined.
Jansen et al. Science 302, 449 (2003)
23
parts of PIP graph
Test whether the thresholded PIP was biased
toward certain complexes, compare distribution of
predictions among gold-standard positives. (A )
The complete set of gold-standard positives and
their overlap with the PIP. The PIP (green)
covers 27 of the gold-standard positives
(yellow). The predicted complexes are roughly
equally apportitioned among the different
complexes ? no bias.
Jansen et al. Science 302, 449 (2003)
24
parts of PIP graph
Graph of the largest complexes in PIP, i.e. only
those proteins having ? 20 links. (Left)
overlapping gold-standard positives are shown in
green, PIE links in blue, and overlaps with both
PIE and gold-standard positives in black. (Right)
Overlapping gold-standard negatives are shown in
red. Regions with many red links indicate
potential false-positive predictions.
Jansen et al. Science 302, 449 (2003)
25
experimental verification
conduct TAP-tagging experiments (?Cellzome) for
98 proteins. These produced 424 experimental
interactions overlapping with the PIP threshold
at Lcut 300. Of these, 185 overlapped with
gold-standard positives and 16 with negatives.
Jansen et al. Science 302, 449 (2003)
26
Concentrate on large complexes
Sofar all interactions were treated as
independent. However, the joint distribution of
interactions in the PIs can help identify large
complexes an ideal complex should be a fully
connected clique in an interaction graph.
In practice, this rarely happens because of
incorrect or missing links. Yet large complexes
tend to have many interconnections between
them, whereas false-positive links to outside
proteins tend to occur randomly, without a
coherent pattern.
Jansen et al. Science 302, 449 (2003)
27
Improve ratio TP / FP
TP/FP for subsets of the thresholded PIP that
only include proteins with a minimum number of
links. Requiring a minimum number of links
isolates large complexes in the thresholded PIP
graph (Fig. 3B).
Observation Increasing the minimum number of
links raises TP/FP by preserving the interactions
among proteins in large complexes, while
filtering out false-positive interactions with
heterogeneous groups of proteins outside the
complexes.
Jansen et al. Science 302, 449 (2003)
28
Summary
Bayesian approach allows reliable predictions of
protein-protein interactions by combining weakly
predictive genomic features.
The de novo prediction of complexes replicated
interactions found in the gold-standard positives
and PIE. Also, several predictions were confirmed
by new TAP experiments.
The accuracy of the PIP was comparable to that of
the PIE while simultaneously achieving greater
coverage.
In a similar manner, the approach could have been
extended to a number of other features related to
interactions (e.g. phylogenetic co-occurrence,
gene fusions, gene neighborhood).
As a word of caution Bayesian approaches dont
work everywhere.
Jansen et al. Science 302, 449 (2003)
29
Dynamic Simulation of Protein Complex Formation
  • - Most cellular functions are conducted or
    regulated by protein complexes of varying size
  • organization into complexes may contribute
    substantially to an organisms complexity.
  • E.g. 6000 different proteins (yeast) may form 18
    ? 106 different pairs of interacting proteins,
    but already 1011 different complexes of size 3.
  • ? mechanism how evolution could significantly
    increase the regulatory and metabolic complexity
    of organisms without substantially increasing the
    genome size.
  • - Only a very small subset of the many possible
    complexes is actually realized.

Beyer, Wilhelm, Bioinformatics
30
Experimental reference data
229 biologically meaningful TAP complexes from
yeast with sizes ranging from 2 to 88 different
proteins per complex.

Cumulative means that there are 229
complexes of size 2 that may also be parts of
larger complexes.
? size-frequency of complexes has common
characteristics of complexes of a given size
versus complex size is exponentially
decreasing Does the shape of this distribution
reflect the nature of the underlying cellular
dynamics which is creating the protein
complexes? ? Test by simulation model
31
Dynamic Complex Formation Model
3 variants of the protein complex
association-dissociation model (PAD-model) are
tested with the following features (i) In all
3 versions the composition of the proteome does
not change with time. Degradation of proteins is
always balanced by an equal production of the
same kind of proteins. (ii) The cell consists
of either one (PAD A B) or several (PAD C)
compartments in which proteins and protein
complexes can freely interact with each other.
Thus, all proteins can potentially bind to all
other proteins in their compartment (risky
assumption!). (iii) Association and dissociation
rate constants are the same for all proteins. In
PAD-models A and C association and dissociation
are independent of complex size and complex
structure.

32
Dynamic Complex Formation Model
(iv) At each time step a set of complexes is
randomly selected to undergo association and
dissociation. Association is simulated as the
creation of new complexes by the binding of two
smaller complexes. Dissociation is simulated as
the reverse process, i.e. it is the decay of a
complex into two smaller complexes. The number
of associations and dissociations per time step
are ka NC 2 and kd NC respectively, NC
total number of complexes in the cell ka
1/(complexes time) association rate
constant kd 1/time dissociation rate
constant. ka and kd correspond to the
biochemical rates of a reversible reaction.

33
Protein Association/Dissociation Models
PAD A the most simple model where all proteins
can interact with each other (no partitioning)
and it assumes that association and dissociation
are independent of complex size. PAD B is
equivalent to PAD A, but larger complexes are
assumed more likely to bind (preferential
attachment). Here, the binding probability is
assumed as proportional to ij, where i and j are
the sizes of two potentially interacting
complexes. PAD C extends PAD A by assuming
that proteins can interact only within groups of
proteins (with partitioning). The sizes of these
protein groups are based on the sizes of first
level functional modules according to the yeast
data base. PAD C assumes 16 modules each
containing between 100 and 1000 different ORFs.
? the protein groups do not represent physical
compartments, but rather resemble functional
modules of interacting proteins.

34
Mathematical Description
- explicit simulation of an entire cell (50
million protein molecules were simulated) is too
time consuming for many applications of the
model. - therefore use a simplified mathematical
description of the PAD model to quickly assess
different scenarios and parameter combinations.
The change of the number of complexes of size
i, ?xi, during one time step ?t can be described
as

(1)
Gia and Gid gains due to association and
dissociation L i a and Lid losses due to
association and dissociation
35
Mathematical Description
Given a total number of NC complexes, the total
number of associations and dissociations per time
step are ka NC2 and k d NC, respectively.
We assume throughout that we can calculate the
mean number of associating or dissociating
complexes of size i per time step as 2 ka
xi NC and kd xi. The probability that
complexes of size j and i-j get selected for one
association is ? deduce the number of
complexes of size i that get created during each
time step via association of smaller complexes
simply by summing over all complex sizes that
potentially create a complex of size i

36
Mathematical Description
When j is equal to i/2 (which is possible only
for even is) both interaction partners have the
same size. The size of the pool xi-j is therefore
reduced by 1 after the first interaction partner
has been selected, which yields a small reduction
of the probability of selecting a second complex
from that pool. Account for this effect with the
correction ?i, which only applies to even is

This correction is usually very small. The loss
of complexes of size i due to association is
simply proportional to the probability of
selecting them for association, i.e.
37
Mathematical Description

Complexes of size i get created by dissociation
of larger complexes. A complex of size j has
possible ways of dissociation and the number
of possible fragments of size i is The
probability that a dissociating complex of size j
gt i creates a fragment of size i is hence
The number of new complexes follows by summing
over all possible parent sizes
The respective loss term becomes
38
Number of complexes formed
The figure shows a comparison of a numerical
solution of equation (1) with a stochastic
simulation of the association-dissociation
process.

39
Steady-state
After a transient period a steady-state is
reached. We are mainly interested in this
steady-state distribution of frequencies xi. ?
find a set of xi solving ?xi/?t 0. The
solution of this non-linear equation system is
obtained by numerically minimizing all ?xi
/?t. By dividing equation (1) by kd it can be
seen that the steady-state distribution is
independent of the absolute values of ka and kd,
but it only depends on the ratio of the two
parameters Rad ka / kd. Hence, only two
parameters affect the xi at steady-state - the
total number of proteins NP (which indirectly
determines NC) and - the ratio of the two rate
constants Rad.

40
Association in model C

For PAD-model B the dissociation terms remain
unchanged, wheras the association terms have to
be modified. In case of PAD C we calculated
weighted averages of results obtained with PAD
A. Assume that association is proportional to
the product of the sizes of the participating
complexes. This assumption changes equation (2)
to
where n is the maximum complex size and
41
Computation of a Dissociation Constant KD
Mathematically our model describes a reversible
(bio-)chemical reaction. ? calculate an
equilibrium dissociation constant KD, which
quantifies the fraction of free subcomplexes A
and B compared to the bound complex AB. This
equilibrium is complex size dependent, because a
large complex AB is less likely to randomly
dissociate exactly into the two specific subunits
A and B than a small complex. (A and B can be
ensembles of several proteins.) We get for any
given complex of size i the following KD KD (i)
AB / AB (Rad Ni V) 1 (4) where
Ni is the number of possible fragments of a
complex of size i and V is the cell volume.
Cell-wide averages of KD -values are estimated by
computing a weighted average with NC being the
total number of complexes and xi being the number
of complexes of size i.

42
Results
- dynamically simulate the association and
dissociation of 6200 different protein types
yielding a set of about 50 million protein
molecules. - analyze the resulting steady-state
size distribution of protein complexes. This
steady-state is thought to reflect the growth
conditions under which the yeast cells were held
when TAP-measuring the protein complexes. -
calculate a protein complex size distribution
from the exp. data to which we can compare the
simulation results (Figure 1).

43
Results
TAP measurements do not provide concentrations of
the measured complexes, they only demonstrate the
presence of a certain protein complex in yeast
cells. Also the number of proteins of a certain
type inside such a complex could not be measured
? the complex size from Figure 1 does not
represent real complex sizes (i.e. total number
of proteins in the complex), but it refers to the
number of different proteins in a complex. The
measured data reflect the characteristics of only
229 different protein complexes of size ? 2,
which is just a small subset of the
complexosome. These peculiarities have to be
taken into account when comparing simulation
results to the observed complex size
distribution. Here, the measurable complex
size is taken as the number of distinct proteins
in a protein complex (Figure 2). When comparing
our simulation results to the measurements, we
always select a random-subset of 229 different
complexes from the simulated pool of complexes.
This results in a complex size distribution
comparable to the measured distribution from
Figure 1 (bait distribution).

44
Effect of preferential attachment
Both simulations were performed with the best fit
parameters for PAD A. In case of preferential
attachment the best regression result (solid
line) is obtained with a power-law, while the
simulation without preferential attachment is
best fitted assuming an exponentially decreasing
curve. The original, measurable and bait
distributions are always close to exponential in
case of PAD A and power-law like in case of PAD
B, independent of the parameters chosen.

Cumulative number of distinct protein complexes
versus their size, resulting from simulations
without (diamonds) and with (squares)
preferential attachment to larger complexes.
PAD B model gives power-law distribution ? not in
agreement with experimental observation.
45
Conclusions
A very simple, dynamic model can reproduce the
observed complex size distribution. Given the
small number of input parameters the very good
fit of the observed data is astonishing (and may
be fortuitous). Preferential attachment does
not take place in yeast cells under the
investigated conditions. This is biologically
plausible Specific and strong binding can be
just as important for small protein complexes as
for large complexes. ? the dissociation should
on average be independent of the complex size.
Interpreting the simulated association and
dissociation in terms of KD-values suggests that
larger complexes bind more strongly than smaller
complexes. However, the size dependence of KD is
compensated by the higher number of possible
dissociations in larger complexes. Here, we
assumed that all possible dissociations happen
with the same probability. In reality large
complexes may break into specific subcomplexes,
which subsequently can be re-used for a different
purpose. ? Improved versions of the model should
account for specificity of association and for
specific dissociation.

46
Conclusions
Conclusion 2 the number of complexes that were
missed during the TAP measurements is potentially
large. Simulations give an upper limit of the
number of different complexes in cells. At a
first glance, the number of different complexes
in PAD A (gt 3.5 mill.) and PAD C ( 2 mill.) may
appear to be far too large. Even PAD C may
overestimate the true number of different
complexes, because association within the groups
is unrestricted. However, the PAD-models do not
only simulate functional, mature complexes, but
they also consider all intermediate steps. Each
of these steps is counted as a different protein
complex. The large difference between the number
of measured complexes and the (potential) number
of existing complexes may partly explain the very
small overlap that has been observed between
different large scale measurements of protein
complexes. A correct interpretation of the
kinetic parameters is important - ka and kd
cannot be compared to real numbers, because the
model does not define a length of the time steps
for interpreting ka and kd as actual rate
constants. - the association-to-dissociation
ratio Rad is not identical to a physical KD-value
obtained by in vitro measurements of protein
binding in water solutions.

47
Discussion
Factors complicating this simple interpretation
(i) In vivo diffusion rates are below those in
water (e.g. 5 20-fold) due to the high
concentration of proteins and other large
molecules in the cytosol. (ii) Most proteins
either are synthesized where they are needed or
they get transported directly to the site where
the complex gets compiled. ? transport to the
site of action is on average faster than random
diffusion. (iii) Protein concentrations are
often above the cell average due to the
compartmentalization of the cell. All these
processes (protein production, transport, and
degradation) are not explicitly described in the
PAD-model, but they are lumped in the
assumptions. The Rad must therefore be
interpreted as an operationally defined property.
It characterizes the overall, cell averaged
complex assembly process, which includes all
steps necessary to synthesize a protein complex.

48
Discussion
However, even the model-derived KD-s allow for
some conclusions regarding complex formation. We
calculated weighted averages (KD ) of the
size-dependent KD -values by using the
steady-state complex size distribution of the
best fit. This yields average KD -s of 4.7 nM
and 0.18 nM for the best fits of PAD A and PAD C,
respectively. First, the fact that the KD for PAD
C is below that of PAD A underlines the notion
that more specific binding is reflected by
smaller KD values. Second, typical in vitro
KDvalues are gt 1 nM. Thus the average KD of PAD
C is quite low. The model confirms that protein
complex formation in vivo gets accelerated due to
directed protein transport and due to the
compartmentalization of eukaryotes.

49
Discussion
The simulated complex size distribution is almost
independent of the assumed protein abundance
distribution. PP is a valuable summarizing
property that can be used to characterize
proteomes of different species. A decreasing PP
increases the number of different large complexes
(the slope in Table 1 gets more shallow), because
it is less likely that a large complex contains
the same protein twice. Thus, PP is a measure
of complexity that not only relates to the
diversity of the proteome but also to the
composition of protein complexes. Probably the
most severe simplification in our model is the
assumption that all proteins can potentially
interact with each other. PAD-model C is a first
step towards more biological realism. By
restricting the number of potential interaction
partners it more closely maps functional modules
and cell compartments, which both restrict the
interaction among proteins.

50
Further improvements
The partitioning in PAD C means that proteins
within one group exhibit very strong binding,
whereas binding between protein groups is set to
zero. This again is a simplification, since
cross-talk between different modules or
compartments is possible. Future extensions of
the model could incorporate more and more
detailed information about the binding
specificity of proteins. Assuming even more
specific binding will further reduce the number
of different complexes, whereas the frequency of
the complexes will increase. High binding
specificity potentially lowers the complex sizes,
so Rad has to be increased in order to fit the
experimentally observed protein complex size
distribution. On the other hand, cross talk
gives rise to larger complexes. Taking both
counteracting refinements into account, it is
impossible to generally predict the best-fit Rad,
since it depends on the quantitative details.

51
Further improvements
- a refinement of PAD C could account for the
observed clustering of protein interaction
networks. - one could simulate protein
associations and dissociations according to
predefined binary protein interactions. - a
detailed model could additionally account for
individual association/ dissociation rates
between individual proteins. Such extensions
will yield more realistic figures about the
number of different protein complexes created in
yeast cells.

52
additional slides (not used)

53
Overview
PIP and PIE are separately tested against the
gold-standard.
Jansen et al. Science 302, 449 (2003)
54
Possible Limitations
In order to get a correct picture of the protein
complex size distribution it is necessary to have
an unbiased, random subset of all complexes in
the cells. TAP data are biased, e.g. contain
too few membrane proteins. However, if compared
to other data sets such as MIPS complexes, the
TAP complexes constitute a fairly random
selection of all protein complexes in
yeast. Uncertainties in the TAP data do not
affect our conclusions as long as they are not
strongly biased with respect to the resulting
complex size distribution. Since Gavin et al.
(2002) have measured long-term interactions, our
results apply to permanent complexes. Yet the
model is applicable to future protein complex
data that take account of transient binding.

55
Protein Abundance Data
Abundance of 6200 yeast proteins ....

Beyer et al. (2004) compiled a protein abundance
data set for yeast under standard conditions in
YPD-medium. Based on this data set we derived a
distribution of protein abundances that resembles
the characteristics of the measured data in the
upper range (Figure S2). For approximately 2000
proteins no abundance values are available. We
assume that the undetected proteins primarily
belong to the low-abundance classes, which gives
rise to the hypothetical distribution.
56
Biochemical Interpretation of the Rate Constants
The process of forming a protein complex AB from
the two subcomplexes A and B, and its
dissociation can be described as a reversible
reaction

with constants kon L/(mol s) and koff 1/s
quantifying the forward and backward reactions
In our model the concentration A can be
calculated as with fA being the fraction of
species A among all NC complexes in the system
and V being the cell volume.
57
Biochemical Interpretation of the Rate Constants
The number of associations of two complex-species
A and B per time step becomes

since we assume kaNC2 many associations per time
step. Here, nA and nB are the number of
complexes of the respective species. Division by
the cell-volume V yields units of concentration
per time. Thus, kon in a biochemical reaction
approximately equals ka V, since the total
number of complexes NC is very large in all
scenarios that we have simulated.
58
Biochemical Interpretation of the Rate Constants
When looking for an equivalent expression for
koff we have to quantify the specific
dissociation of a complex AB into the
subcomplexes A and B. The unspecific
dissociation of AB is simply kd AB, kd
dissociation rate constant. Since AB may
consist of gt 2 proteins it can also be split into
subcomplexes other than A and B. For the specific
dissociation rate, one has to know how often AB
actually dissociates into the subcomplexes A and
B. The total number of dissociations per time
step is kd NC. The probability that a complex
AB with size i breaks into the specific
sub-complexes A and B is 1/Ni, Ni number of
possible fragments of a complex of size i. This
holds under the assumption that all proteins in
AB are distinct, which is approximately true for
the simulations conducted here.
59
Biochemical Interpretation of the Rate Constants

nAB/NC fraction of complexes AB among all
complexes ? size specific dissociation rate N AB
dissoc (i)
from which the complex size dependent rate
constant koff.(i) kd/Ni results. Taking into
account that certain proteins may be in the
complex more than once we get koff kd/Ni. One
can calculate an apparent equilibrium constant
KD, which describes the equilibrium between the
independent species A and B and the bound species
AB
where i is the size of the complex AB. Since Ni
is exponentially increasing with i, KD is
exponentially decreasing with complex size.
60
Measurable Size Distribution and Bait Selection
Based on the distribution resulting from equation
(1) at steady-state derive two further
distributions (i) the measurable size
distribution and (ii) the bait distribution.
The former is defined as the frequency
distribution of the measurable complex sizes.

The measurable complex size is the number of
different proteins in a protein complex (as
opposed to the total number of proteins). For
the measurable size-distribution we only count
the number of complexes with distinct protein
compositions.
Measurable versus actual complex size
distribution. Diamonds show frequencies of actual
complex sizes and triangles are frequencies of
measurable complexes. Filled diamonds and
triangles reflect simulation without partitioning
(PAD A) and open diamonds and triangles are
simulation results assuming binding only within
certain modules (PAD C). The difference between
the original and the measurable complex size
distribution is comparably small, because most of
the simulated complexes are unique. However, in
case of PAD C smaller complexes occur at higher
copy numbers and larger complexes are often
counted as smaller measurable complexes because
they contain some proteins more than once.
61
Reliability of Protein Interaction Networks
Direct comparison of different data sets
62
High-throughput methods for detecting protein
interactions
Yeast two-hybrid assay. Pairs of proteins to be
tested for interaction are expressed as fusion
proteins ('hybrids') in yeast one protein is
fused to a DNA-binding domain, the other to a
transcriptional activator domain. Any interaction
between them is detected by the formation of a
functional transcription factor. Benefits it is
an in vivo technique transient and unstable
interactions can be detected it is independent
of endogenous protein expression and it has fine
resolution, enabling interaction mapping within
proteins. Drawbacks only two proteins are tested
at a time (no cooperative binding) it takes
place in the nucleus, so many proteins are not in
their native compartment and it predicts
possible interactions, but is unrelated to the
physiological setting. Mass spectrometry of
purified complexes. Individual proteins are
tagged and used as 'hooks' to biochemically
purify whole protein complexes. These are then
separated and their components identified by mass
spectrometry. Two protocols exist tandem
affinity purification (TAP), and high-throughput
mass-spectrometric protein complex identification
(HMS-PCI). Benefits several members of a complex
can be tagged, giving an internal check for
consistency and it detects real complexes in
physiological settings. Drawbacks it might miss
some complexes that are not present under the
given conditions tagging may disturb complex
formation and loosely associated components may
be washed off during purification. Correlated
mRNA expression (synexpression). mRNA levels are
systematically measured under a variety of
different cellular conditions, and genes are
grouped if they show a similar transcriptional
response to these conditions. These groups are
enriched in genes encoding physically interacting
proteins. Benefits it is an in vivo technique,
albeit an indirect one and it has much broader
coverage of cellular conditions than other
methods. Drawbacks it is a powerful method for
discriminating cell states or disease outcomes,
but is a relatively inaccurate predictor of
direct physical interaction and it is very
sensitive to parameter choices and clustering
methods during analysis.
Von Mering et al. Nature 417, 399 (2002)
63
High-throughput methods for detecting protein
interactions
Genetic interactions (synthetic lethality). Two
nonessential genes that cause lethality when
mutated at the same time form a synthetic lethal
interaction. Such genes are often functionally
associated and their encoded proteins may also
interact physically. This type of genetic
interaction is currently being studied in an
all-versus-all approach in yeast. Benefits it is
an in vivo technique, albeit an indirect one and
it is amenable to unbiased genome-wide
screens. In silico predictions through genome
analysis. Whole genomes can be screened for three
types of interaction evidence (1) in prokaryotic
genomes, interacting proteins are often encoded
by conserved operons (2) interacting proteins
have a tendency to be either present or absent
together from fully sequenced genomes, that is,
to have a similar 'phylogenetic profile' and (3)
seemingly unrelated proteins are sometimes found
fused into one polypeptide chain. This is an
indication for a physical interaction. Benefits
fast and inexpensive in silico techniques and
coverage expands as more genomes are sequenced.
Drawbacks it requires a framework for assigning
orthology between proteins, failing where
orthology relationships are not clear and so far
it has focused mainly on prokaryotes.
Von Mering et al. Nature 417, 399 (2002)
64
Data set
Experiment Uetz et al. 957 interactions Ito
et al. 4549 interactions HMS-PCI 33014
interactions In silico Conserved gene
neighborhood 6387 interactions Gene fusions
358 interactions Co-occurrence of genes 997
interactions
Von Mering et al. Nature 417, 399 (2002)
65
Counting interactions
  • Various high-throughput methods give differing
    results on the same complex.
  • gt80.000 interactions available for yeast.
  • Only 2.400 are supported by more than 1 method.
  • Possible explanations ?
  • Methods may not have reached saturation
  • Many of the methods produce a significant
    fraction of false positives
  • Some methods may have difficulties for certain
    types of interactions

Von Mering et al. Nature 417, 399 (2002)
66
Protein interactions between functional
categories
Each technique produces a unique distribution of
interactions with respect to functional
categories ? methods have specific strengths and
weaknesses. E.g. TAP and HMS-PCI predict few
interactions for proteins involved in transport
and sensing because these categories are enriched
with membrane proteins. E.g. Y2H detects few
proteins involved in translation.
Von Mering et al. Nature 417, 399 (2002)
67
Complementarity between data sets
  • Glycine decarboxylase
  • Multienzyme complex needed when Gly is used as
    1-carbon source.
  • Its key components GCV1, GCV2, GCV3 are only
    induced when there is excess Glycine and folate
    levels are low. This may explain why complex is
    not detected in experiments.
  • However, 3 components can be detected by several
    independent in silico methods
  • Gene neighborhood of all 3 components in 7
    diverged species
  • genes show very similar phylogenetic
    distribution
  • microarrays genes are closely co-regulated.

Opposite example PPH3 protein Complex found in 4
independent purifications, but no in silico
method predicts interaction.
Von Mering et al. Nature 417, 399 (2002)
68
Quantitative comparison of interaction data sets
The various data sets are benchmarked against a
reference set of 10,907 trusted interactions,
which are derived from protein complexes
annotated manually at MIPS and YPD databases.
Coverage and accuracy are lower limits owing to
incompleteness of the reference set. Each dot in
the graph represents an entire interaction data
set. For the combined evidence, consider only
interactions supported by an agreement of two (or
three) of any of the methods shown.
Von Mering et al. Nature 417, 399 (2002)
69
Biases in interaction coverage
Experiment Uetz et al. 957 interactions Ito
et al. 4549 interactions HMS-PCI 33014
interactions In silico Conserved gene
neighborhood 6387 interactions Gene fusions
358 interactions Co-occurrence of genes 997
interactions None of the methods covers more
than 60 of the proteins in the yeast
genome. Are there common biases as to which
proteins are covered?
Von Mering et al. Nature 417, 399 (2002)
70
Bias 1 towards proteins of high abundance
mRNA abundance is a rough measure of protein
abundance. Here, divide yeast genome into 10
mRNA abundance classes (bins) of equal size. For
each data set and abundance class, the number of
interactions is recorded having at least one
protein in that class. Each interaction (AB) is
counted twice once under the abundance class of
partner A, and once under the abundance class of
partner B. ? Most data sets are heavily biased
towards proteins of high abundance except for
genetic techniques (Y2H and synthetic lethality)
Von Mering et al. Nature 417, 399 (2002)
71
Bias 2 towards cellular localization
Protein localization and interaction coverage.
Protein localizations are derived from the MIPS
and TRIPLES databases. a, The distribution of
protein localization among the proteins covered
by a data set. E.g. in silico predictions
overestimate mitochondrial interactions.
Von Mering et al. Nature 417, 399 (2002)
72
Bias 2 towards cellular localization
Independent quality measure Are proteins that
interact belong to the same compartment? Y2H
method gives relatively poor results here.
Von Mering et al. Nature 417, 399 (2002)
73
Bias 3 in interaction coverage
Separate yeast genome into 4 classes according to
the conservation of the genes in other
species The presence of a gene in any of these
species was concluded from bi-directional best
hits in Swiss-Waterman searches, using 0.01 as
cut-off. Bias related to the degree of
evolutionary novelty of proteins. Proteins
restricted to yeast are less well covered than
ancient, evolutionarily conserved proteins.
Von Mering et al. Nature 417, 399 (2002)
74
Outlook
  • How many protein-protein interactions can be
    expected in yeast?
  • Overlap of high-throughput data is 20 times
    larger than expected by chance.
  • Good signal-to-noise ratio.
  • Also, for interactions discovered 2 times,
    usually both partners have the same functional
    category and cellular localization.
  • ? Overlap mainly consists of true positives.
  • Less than 1/3 of new interactions in overlap set
    were previously known.
  • Given 10.000 currently known interactions predict
    gt30.000 protein interactions in yeast (lower
    boundary).

Von Mering et al. Nature 417, 399 (2002)
75
Problems
Unfortunately, interaction data sets are often
incomplete and contradictory (von Mering et al.
2002).
In the context of genome-wide analyses, these
inaccuracies are greatly magnified because the
protein pairs that do not interact (negatives) by
far outnumber those that do interact (positives).
E.g. in yeast, the 6000 proteins allow for N
(N-1) / 2 18 million potential interactions.
But the estimated number of actual interactions
is lt 100.000.
Therefore, even reliable techniques can generate
many false positives when applied genome-wide.
Think of a diagnostic with a 1 false-positive
rate for a rare disease occurring in 0.1 of the
population. This would roughly produce 1 true
positive for every 10 false ones.
Jansen et al. Science 302, 449 (2003)
Write a Comment
User Comments (0)
About PowerShow.com