Sequential Sampling Designs for Small-Scale Protein Interaction Experiments - PowerPoint PPT Presentation

Loading...

PPT – Sequential Sampling Designs for Small-Scale Protein Interaction Experiments PowerPoint presentation | free to download - id: 6813e8-YjJjY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Sequential Sampling Designs for Small-Scale Protein Interaction Experiments

Description:

Sequential Sampling Designs for Small-Scale Protein Interaction Experiments Denise Scholtens, Ph.D. Associate Professor, Northwestern University, Chicago IL – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Date added: 4 December 2019
Slides: 39
Provided by: Bioinform97
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Sequential Sampling Designs for Small-Scale Protein Interaction Experiments


1
Sequential Sampling Designs for Small-Scale
Protein Interaction Experiments
  • Denise Scholtens, Ph.D.
  • Associate Professor, Northwestern University,
    Chicago IL
  • Department of Preventive Medicine, Division of
    Biostatistics
  • Joint work with Bruce Spencer, Ph.D.
  • Professor, Northwestern University, Evanston IL
  • Department of Statistics and Institute for Policy
    Research

2
Large Scale Protein Interaction Graphs
  • Often steady-state organisms
  • E.g. Saccharomyces cerevisiae, various
    interaction types
  • Gavin et al. (2002, 2006) Nature, Ho et al.
    (2002) Nature, Krogan et al. (2006) Nature, Ito
    et al. (1998) PNAS, Uetz et al. (2000) Nature,
    Tong et al. (2006) Science, Pan et al. (2006)
    Cell
  • Topology
  • Modular organization into complexes/groups
  • Bader et al. (2003) BMC Bioinformatics,
    Scholtens et al. (2005) Bioinformatics, Zhang et
    al. (2008) Bioinformatics, Qi et al. (2008)
    Bioinformatics
  • Global characterization as small-world,
    scale-free, hierarchical, etc.
  • Watts and Strogatz (1998) Nature, Barabási and
    Albert (1999) Science, Sales-Pardo et al. (2007)
    PNAS
  • Measurement Error
  • False positive/negative probabilities
  • Chiang et al. (2007) Genome Biology, Chiang and
    Scholtens (2009) Nature Protocols
  • Mostly large graphs
  • 100s-1000s of nodes
  • 1000s-10,000s of edges

Fig. 4, Gavin et al. (2002) Nature Top
panel Nodes protein complex estimates Edges
common members Bottom panel Nodes
proteins Edges complex co-membership
(often called indirect interaction)
3
Sampled data
Three baitprey pull-downs from Gavin et al.
(2002) Apl5 Apl6, Apm3, Aps3, Ckb1 Apl6 Apl5,
Apm3, Eno2 Apm3 Apl6, Apm3
One AP-MS pull-down
bait
prey
Eno2
untested ?
AP-MS data capture bait-prey relationships
a bait finds interacting prey with
common membership in at least
one complex
Apl6
Apl6
Aps3
Apl5
Apm3
Apl5
Ckb1
tested absent
Maximal cliques map to protein complexes
when all proteins are used as baits,
all nodes have edges to all other
nodes in the clique, and the
clique is not contained in any
other clique
NOTE Failure to test all edges means we
typically cannot identify maximal
cliques
4
Inference using a portion of possible baits
B
C
Two protein complexes with physical
topologies shown by black edges
D
A
A
F
E
If the AP-MS technology works perfectly (I.e. no
false positives or false negatives)
2 Baits AB
3 Baits ABC
6 Baits ABCDEF
1 Bait A
B
B
B
B
D
F
D
F
D
F
D
F
A
A
A
A
C
E
C
E
C
E
C
E
9 tested edges 7 present 2 absent 6
untested edges
12 tested edges 8 present 4 absent 3
untested edges
15 tested edges 9 present 6 absent
5 tested edges 5 present 0 absent 10
untested edges
5
Smaller-scale studies
  • What if we are interested only in a portion of
    the graph?
  • Cataloguing complexes/ describing the local
    neighborhood for a pre-specified set of starting
    baits
  • Comparing local neighborhoods for different
    sample types
  • disease vs. normal
  • treated vs. untreated

Starting bait of interest
Interesting neighbor
Less interesting neighbor
Uninteresting neighbor
6
Link tracing designs (or snowball sampling)
  • Start with a set of nodes as starting baits (S0)
  • Identify interacting partners
  • Use interacting partners as new set of baits,
    excluding those already used as baits
  • Identify their interacting partners
  • Etc.

S0
S1
S2
S3
7
Link tracing notation Adapted from Handcock and
Gile (2010) Annals of Applied Statistics
8
Link tracing notation Adapted from Handcock and
Gile (2010) Annals of Applied Statistics
9
Link tracing notation
10
Link tracing notation
11
A simple scheme
  • Let ?m remain constant over all sampling waves,
    e.g. choose a fixed proportion p of all eligible
    baits at each wave.
  • This leads to a simplification in the probability
    of observing a specific sample. In particular,

n
Pr(Sm sm Em,?m) p (pEmi)smi((1-pEmi))(1-smi)
i1
12
Sampling 1/4 of all eligible baits S0
n1,n2,n3 E1 n4,n6,n12,n13,n14,n15,n16,n17 S1
n4,n12
E2 n6,n13,n14,n15,n16,n17,n34,n35,
n36,n37,n38,n59,n97,n98,n99n100,n194 S2
n15,n59,n97,n98,n99
Etc
Note that we do not cover all portions of the
graph that we would with a full snowball sample.
13
Negative binomial
  • In this setting, a path of length l extending
    from one of the starting baits follows a negative
    binomial distribution for being tested (and
    therefore observed) in m rounds of sampling (0 lt
    l m).
  • Pr(observing a path of length l in m rounds) (
    )pl(1-p)m-l ml,l1,

m-1
l-1
Test all 3 nodes/edges in 3 rounds
p
p
p
1-p
p
p
p
Test 3 nodes/edges in 4 rounds
14
Cumulative probabilities
  • The cumulative probability for observing paths
    with nodes that are sampled early on is higher
    than those that enter later.
  • When nodes are tightly grouped in cliques, this
    can lead to over-sampling in regions of the graph
    with high-confidence clique estimates.
  • Ie, we may be satisfied with a clique estimate
    that has a certain proportion of tested edges,
    but if the involved nodes are identified early in
    the process, chances are they will eventually
    enter the sampleso how can we move on and sample
    other areas?
  • There is also great dependency among joint
    probabilities of testing any pair (or larger
    collection) of paths, especially among nodes with
    common paths extending from the starting baits.

15
Tested fraction of edges
  • In addition, we are interested in complexes with
    a certain proportion of tested edges out of those
    that are possible, not necessarily a proportion
    of tested baits (although they are related)

2 Baits AB
3 Baits ABC
6 Baits ABCDEF
1 Bait A
B
B
B
B
D
F
D
F
D
F
D
F
A
A
A
A
C
E
C
E
C
E
C
E
9 tested edges 9/15 3/5 tested
12 tested edges 12/15 4/5 tested
15 tested edges 9 present 6 absent
5 tested edges So 1/3 of possible edges are
tested
16
Edge imputation
  • Assume a simple edge imputation scheme in which
    untested edges are assumed to exist if the
    involved prey share at least one common bait.
  • This is consistent with high clustering
    coefficients observed for these types of graphs
    as well as existing clique estimation algorithms
    on partially observed graphs.
  • A complex (or clique) estimate may be considered
    high quality if more than half of the involved
    edges are tested and observed.

High Quality 9/150.6 edges observed
Low Quality 13/280.46 edges observed
17
Tested fraction of edges
  • In a collection of nodes involving b baits and q
    prey-only nodes with no measurement error for
    edge observations, we have
  • b(b-1)/2 tested edges among baits
  • bq tested edges among bait-prey pairs
  • (bq)(bq-1)/2 possible edges among all nodes
  • So then the proportion of observed edges is
  • b(b-1) 2bq
  • (bq)(bq-1)

18
A modification capturing dependency among nodes
B
C
Two protein complexes with physical topologies
D
A
A
F
E
Affiliation matrix nodes to cliques
Incidence matrix among nodes
Corresponding AP-MS graph
c1 c2
A 1 1
B 1 0
C 0 1
D 1 0
E 0 1
F 1 0
A B C D E F
A 1 1 1 1 1 1
B 1 1 0 1 0 1
C 1 0 1 0 1 0
D 1 1 0 1 0 1
E 1 0 1 0 1 0
F 1 1 0 1 0 1
B
A
Y AAT Boolean algebra 1111101 000
0010
D
F
A
C
E
19
Strata Nodes with identical adjacency
AP-MS graph
A B C D E F
A 1 1 1 1 1 1
B 1 1 0 1 0 1
C 1 0 1 0 1 0
D 1 1 0 1 0 1
E 1 0 1 0 1 0
F 1 1 0 1 0 1
B
Y
D
F
A
C
E
20
  • All nodes with matching colors on the previous
    slide are connected to each other, and have
    matching sets of adjacent nodes
  • In some sense, they contain redundant
    information
  • And in a measurement error setting, extremely
    highly correlated information
  • If we know the strata, and we know the set of
    adjacent nodes for one member node, then we know
    the set of adjacent nodes for all other strata
    constituents
  • For sampling purposes, it seems reasonable to
    represent these subpopulations by design

21
AP-MS graph
c1 c2
A 1 1
B 1 0
C 0 1
D 1 0
E 0 1
F 1 0
A B C D E F
A 1 1 1 1 1 1
B 1 1 0 1 0 1
C 1 0 1 0 1 0
D 1 1 0 1 0 1
E 1 0 1 0 1 0
F 1 1 0 1 0 1
B
A
Y AAT Boolean algebra 1111101 000
0010
D
F
A
C
E
Affiliation matrix nodes to strata
Affiliation matrix strata to cliques
g1 g2 g3
A 1 0 0
B 0 1 0
C 0 0 1
D 0 1 0
E 0 0 1
F 0 1 0
BDF
c1 c2
g1 1 1
g2 1 0
g3 0 1
X
Q
A
CE
22
  • Note the following properties
  • QQT is the incidence matrix
  • among strata
  • XQ A
  • XQ(XQ)T AAT Y

23
Stratified sampling
  • The idea use estimated strata to inform sampling
  • Maintain a constant fraction of tested edges
    within each estimated strata
  • This will help identify strata and summarize
    their connectivity to other strata
  • It will also help focus our resources in areas
    that require more observations as opposed to
    those that have been adequately sampled according
    to some desired threshold for the fraction of
    tested edges

24
Stratified sampling
Testing at least half of the edges within a
stratum with 10 member nodes At least 3 baits
are required
Have 1 bait Choose 2 more baits
Have 2 baits Choose 1 more bait
Have 4 baits Dont sample from this stratum (or
do so with small probability)
25
Stratified sampling
  • While the strata and the fraction of tested edges
    within them determine the number of additional
    baits to include, the samples do also include
    observations of edges connecting pairs nodes in
    different strata

Tested edge within strata
Tested edge between strata
26
Stratified sampling
  • Algorithm
  • Specify starting baits S0 and form E1
  • Impute edges among prey-only nodes with at least
    one common bait
  • Estimate strata according to matching adjacency
    in Y1 to form X1
  • Calculate fraction of tested edges for each
    stratum determined by X1
  • Determine number of additional baits required for
    each stratum and sample accordingly to form S1
  • Repeat
  • At each step k, we can also estimate Qk, Yk
    and/or Ak

27
A comparison Threshold sampling
  • Similar to the simple random sampling scheme
    introduced earlier
  • Rather than specifying a set proportion of baits
    to test, sample the appropriate number to test a
    certain fraction of all possible edges in the
    graph given the identified nodes

28
Simulation In silico Interactome
  • We used the ScISI Bioconductor package to create
    an in silico interactome containing protein
    complex data reported in the Cellular Component
    Gene Ontology and at MIPS for Saccharomyces
    cerevisiae.
  • The largest connected component of the resultant
    graph contains 1404 nodes and 86609 edges.
  • 197 protein complexes are represented with a
    range of sizes from 2 to 308 (median 18).

29
Simulation Study
  • Compared stratified(str) and threshold (thresh)
    sampling schemes
  • Specified tested fractions of 1/10 and 1/20 of
    all possible edges
  • Called a complex high quality if at least 1/2
    of the edges were tested
  • For each iteration, randomly chose 3 nodes with
    close proximity as starting baits
  • 250 rounds for each scheme

30
Mean number correctly identified high-quality
complexes
31
Standard errors on number of correctly identified
complexes
32
Standard error / number identified
33
Cumulative number of baits
mean number of complexes
34
Number of baits per complex
35
Number of complexes vs. number of baits
36
Discussion
  • Large-scale protein interaction experiments are
    very costly and may not be of interest in smaller
    lab settings or for investigations of particular
    cellular functions
  • As long as we are comfortable with some
    estimation of untested edges, sampling identified
    prey to create the next bait set may yield
    considerable savings

37
Discussion
  • Using estimated sampling strata seems to provide
    a greater balance of resource allocation across
    the graph
  • Work still in progress suggests that this is due
    to a reduction in cumulative sampling variability
    across the graph
  • As long as the per-bait cost is less than the
    per-sampling-round cost, stratified sampling
    appears to be a better approach

38
Extensions
  • Measurement error can be easily included in
    specification of Em, and adaptations of clique
    identification (e.g. the penalized likelihood
    method in Bioconductors apComplex) can be used
    instead of straightforward imputation
  • This would also be a natural starting point for
    adaptively designing experiments to compare
    different sample types
About PowerShow.com