Respondentdriven Sampling for Characterizing Unstructured Overlays - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Respondentdriven Sampling for Characterizing Unstructured Overlays

Description:

Experiments over Gnutella network. 9. Evaluation: Static Graphs ... Experiment: Gnutella. Run crawler, 1000 RDS & 1000 MRW walkers in parallel. 500 steps per walker ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 20
Provided by: ixCsUo
Category:

less

Transcript and Presenter's Notes

Title: Respondentdriven Sampling for Characterizing Unstructured Overlays


1
Respondent-driven Sampling for Characterizing
Unstructured Overlays
Graciously Presented By Shubho Sen ATT Labs
- Research
2
Motivation
  • P2P systems are very popular in practice.
  • Millions of simultaneous users.
  • A significant fraction of Internet traffic
  • Measurement studies aid understanding existing
    systems and user behavior.
  • Capturing an accurate global snapshot is often
    infeasible.
  • P2P systems are distributed, large, and rapidly
    changing.
  • P2P crawlers are likely to capture incomplete or
    distorted snapshots
  • Sampling is a natural approach, and has been used
    implicitly in most earlier P2P measurement
    studies.
  • How can we collect representative samples?

3
The Graph Sampling Problem
  • We focus on sampling peer properties, such as
    number of neighbors (degree), access link
    bandwidth, session time, files
  • Sampling peer properties has two steps
  • Discovering and selecting peers (or samples)
  • Measuring the desired properties of selected
    peers
  • Selecting peers uniformly at random is hard
    there are two sources of bias StutzbachIMC06
  • Topological high-degree peers are more likely to
    be selected
  • Temporal short-lived peers are more likely to
    be selected
  • Random walks are a promising approach to sampling
  • The resulting bias is precisely known
  • Samples can be collected in parallel by multiple
    walkers

4
Sampling Using Random Walk
  • Random walks can be described with a transition
    matrix P(x,y)
  • P(x,y) probability of moving from x to y
  • Pr(x,y) probability of moving from x to y after
    r moves
  • Random walks converge to a stationary
    distribution
  • Problem we need a uniform distribution

5
Metropolized Random Walk (MRW)
  • The Metropolis-Hastings method modifies the
    transition matrix to yield the desired uniform
    distribution StutzbachIMC06
  • MRW method
  • Select a neighbor y of x uniformly at random
  • Transition to y with probability min(
    deg(x)/deg(y) , 1)
  • Otherwise, self-loop to x.
  • Results in uniform stationary dist. ?(x) 1/V
  • MRW compensates for bias as samples are collected

y
x
6
This paper
  • Presents a new graph sampling technique,
    Respondent-Driven Sampling (RDS)
  • Compares the performance of RDS and MRW sampling
    techniques using simulations experiments

7
Respondent-driven Sampling
  • A development of Snowball Sampling Salganik04
  • Commonly used in social sciences to sample
    hidden populations, e.g. HIV individuals
  • Social relationships (references) are used by
    sampler to diffuse into hidden populations
  • Each person introduces n other persons
  • Similar to random walk (n 1)
  • We adopt the RDS technique from social sciences
    for sampling P2P networks

8
RDS Formulation
  • Goal Estimate the distribution of node property
    X
  • Perform regular random walk, collect values of
    property X and node degree (deg(v)) at each
    visited node
  • Deal with the bias during the post-processing as
    follows
  • Divide possible values for X into several ranges
    R1, . . . ,Rm
  • Partition nodes with the X value within the same
    range V1, . . . ,Vm
  • Using Hansen-Hurwitz estimator to compensate for
    the bias, the proportion of all nodes in group i
    is estimated as follows
  • Ti visited samples in group i
  • T all visited samples

9
Evaluation Overview
  • Performance metric
  • Consider only peer properties that may interact
    with the walk
  • 1) Peer Degree, 2) Peer Uptime, 3) Peer RTT
  • Compare the dist. of the these peer properties
    from samples and ground truth using
    Kolmogorov-Smirnov (KS) statistics
  • Evaluation Methodology
  • Evaluation over static graphs
  • Effect of graph structure
  • Evaluation over dynamic graphs (session level
    simulation)
  • Benefits of parallel Sampling (see the paper)
  • Effect of 1) churn, 2) peer discovery, 3)
    target peer degree
  • Experiments over Gnutella network

10
Evaluation Static Graphs
  • Using graphs with different degree distribution
    clustering characteristics
  • Random graphs (ER) Erdos-Renyi
  • Small-world graphs (SW) Watts and Strogatz
  • Scale-free graphs (BA) Barabasi and Albert
  • Hierarchical Scale-Free graphs (HSF) Barabasi
    02
  • Power-law degree distribution
  • Node clustering is inversely proportional to node
    degree
  • Gnutella graphs (GA) Snapshots of Gnutella
    Ultrapeer topology

11
Hierarchical Scale-Free (HSF)
12
Static Graphs
  • Accuracy of both techniques is improved with the
    number of samples in most cases
  • The rate of improvement in accuracy is much lower
    over HSF especially for MRW
  • Walkers are likely to get trapped within clusters
    in HSF graphs
  • Leaving a cluster requires visiting high degree
    nodes but MRW is less likely to visit these nodes
  • Rewiring a small fraction of randomly selected
    edges in HSF significantly improves accuracy for
    both techniques
  • RDS is less sensitive to graph clustering than MRW

13
Dynamic Graphs
  • Churn is a primary limiting factor for accuracy
  • Session len.gt 5m ? Very good sampling accuracy
  • Churn model has little effect
  • Similar impact on other peer properties (see the
    paper)
  • Sampling error is small once nodes have
    sufficient connectivity (gt 5)
  • Lower accuracy for smaller degree is due to graph
    partitioning
  • Partitioned nodes in History mech. reduce the
    accuracy of sampling

14
Experiment Gnutella
  • Run crawler, 1000 RDS 1000 MRW walkers in
    parallel
  • 500 steps per walker
  • Use captured snapshots by crawler as a rough
    reference
  • Show min, max, avg KS over 6 experiments
  • Focus only on degree dist
  • The degree dist from samples crawls are very
    similar (KS0.03)
  • The accuracy is an order of magnitude lower than
    dynamic sim due to inaccurate reference.
  • Both sampling technique achieve similar accuracy

15
Conclusions Future Work
  • RDS always performs as good or better than MRW
  • High level of graph clustering can significantly
    degrade the accuracy of both RDS and MRW
  • RDS is less sensitive than MRW to graph
    clustering
  • There is sweet spot for the number of parallel
    samplers.
  • Poor connectivity high dynamics adversely
    affect the accuracy of both techniques.
  • Future Work
  • RDS is a promising approach for sampling user
    properties in Online Social Networks
  • Sampling over directed graphs raises new
    challenges.

16
  • Thank You !

17
Different Grpah Structures
18
Dynamic Simulation Setting
  • Simulation environment
  • Session-time distributions Weibull,
    Exponential, Pareto
  • Poisson arrival process
  • Peer discovery Oracle, FIFO, HeartBeat, History
  • Target population 100000
  • Min. Degree 3-30
  • Sampling Parameters
  • Node degree (DEG)
  • Node query latency (RTT)
  • Session length/uptime (UT)

19
Evaluation Static Graphs Conclusion
  • Combination of highly skewed degree distribution
    and highly skewed clustering traps samplers
  • RDS samplers get out of the clusters quickly
  • MRW samplers get stuck in low-degree clusters
  • Shuffling provides short-cuts out of clusters
  • HSF can be a model for some natural and social
    networks

20
Dynamic Graphs Effect of Parallelism
  • Too much parallelism does not improve performance
  • Too long random walks have negative effect
  • Sweet spot exists
Write a Comment
User Comments (0)
About PowerShow.com