Unveiling Hidden Topologies: Applications, Algorithms and Measurements - PowerPoint PPT Presentation

About This Presentation
Title:

Unveiling Hidden Topologies: Applications, Algorithms and Measurements

Description:

Capturing the practical issues in a parsimonious model is a formidable challenge. ... Forums for discussion and dissemination of ideas across disciplines will help. ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 46
Provided by: lakhinabye
Learn more at: https://www.cs.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Unveiling Hidden Topologies: Applications, Algorithms and Measurements


1
Unveiling Hidden TopologiesApplications,
Algorithms and Measurements
John Byers Department of Computer Science,
Boston University
2
Hidden Topologies What
  • Given an underlying graph G whose
  • vertices are known wholly or partially in advance
  • edges are unknown in advance
  • Identify properties of the edge set via a
    sequence of probes (samples).
  • Application-specific probing
  • Statistical vs. topological properties
  • Exact vs. approximate guarantees
  • Adaptive vs. non-adaptive sampling

3
Hidden Topologies Where
  • Traditional science study of found objects.
  • Protein-protein interaction networks
  • Metabolic networks
  • Genome mapping
  • Emerging domain study engineered artifacts,
    with scientific posture accorded to found
    objects.
  • Internet topology map is not known
  • Size, proprietary information, distributed
  • Router-level topology vs. AS-level topology
  • Dynamic topologies up-to-date maps are
    infeasible to maintain.
  • Examples P2P, overlays, large testbeds

4
Hidden Topologies Why
  • Compelling applications
  • Existing approaches are point solutions
  • Cross-cutting theory is not yet well developed
  • Pitfalls/weaknesses not widely disseminated
  • Impact of models
  • better models may make for better algorithms
  • principles inform probing process

5
Hidden Topologies Foundations
  • Traceroute exploration of many graphs yields
    heavy-tailed subnets LBCX 03, ACKM 05
  • Random subnets of scale-free graphs are not
    scale-free Stumpf, Wiuf, May PNAS 3/22/05
  • Parsimonious subgraph generation model
  • Vertices selected uniformly at random.
  • Edge (i, j) included iff both i and j
    selected.
  • Two examples of strong sampling bias.

6
Outline
  • Motivating hidden graphs
  • Case study 1 Hidden graphs in genome sequencing
    applications.
  • Case study 2 Internet mapping studies.
  • Case study 3 Locating constrained, annotated
    Internet subgraphs.
  • Discussion

7
Example Interaction networks
  • Protein-protein interaction (PPI) networks
    genomics.
  • Nodes correspond to (known) chemicals.
  • Edges correspond to observable chemical
    reactions.
  • Probe Combine an arbitrary subset S of
    chemicals.
  • Binary probe QG (S)
  • 0 non-existence of any edge within S
  • 1 existence of at least one edge in S
  • Example Genome sequencing of contigs
  • Model at most one incident edge per node
  • Goal identify hidden matching efficiently.
  • Shotgun sequencing parallelize probe process

8
Querying hidden graphs
  • Grebinski, Kucherov 97, 98 Asymptotically
    optimal query bounds for hidden Hamiltonian
    cycles.
  • Beigel, Alon, Apaydin, Fortnow, Kasif 01
    Asymptotically optimal query bounds for hidden
    matchings.
  • Alon, Beigel, Kasif, Rudich, Sudakov 02
    Nearly-tight upper and lower bounds on hidden
    matchings for both deterministic and randomized
    algorithms.
  • Angluin, Chen 04 Learning hidden graphs in
    O(m log n) queries.
  • Alon, Asodi 04 Learning hidden subgraphs.
  • Angluin, Chen 04 Learning hidden
    hypergraphs.
  • (Numerous experimental biology papers).

9
Matching Query
(Slides courtesy of Simon Kasif)
10
Matching Query
Yes
11
Matching Query
No
12
Upper and Lower Bounds
13
1- or 2-Round Probabilistic Algorithm
  • Form O(n logn) tubes of size O(?n) independently
    at random. Test each tube to see if it contains
    an edge.
  • 1a. For each pair u,v, see if u,v is
    contained in a tube that tested negative in step
    1. If so, u,v is a nonedge.
  • 1b. For each pair p,q, see if p,q is
    contained in a tube that tested positive in step
    1 but in which every other pair was determined to
    be a nonedge in step 1a. If so, p,q is an
    edge.
  • 2. Test all pairs whose status is still unknown.

14
Probabilistic Algorithms (optimizing constants)
  • Procedure RPP (random projective plane)
  • Assume n p 2 p 1
  • Randomly permute all n vertices and identify them
    with the points of the projective plane P of
    order p.
  • Perform one test for each line in P.
  • Fix x,y. x,y belongs to a unique line in P.
    The probability that that line contains no
    matched edge (except perhaps x,y itself) is ?
    e-1/2.

15
2-rounds, 0.74n logn tests, 0-sided error
  • Perform procedure RPP d logn times independently
    in parallel.
  • The probability that every line containing x,y
    contains an edge (besides x,y) is at most ?(d)
    ? ((1-e-1/2)d logn.
  • Choosing d ? 0.74, ?(d) ? 1/n. The
    remaining edges (at most n/2 on average) are
    tested in round 2.

16
Modeling and algorithms success
  • Highly structured hidden graph
  • Clean abstraction
  • Flexible probing process
  • Amenable to randomization, parallelization
  • Practically useful

17
Outline
  • Motivating hidden graphs
  • Case study 1 Hidden matchings in genome
    sequencing.
  • Case study 2 Internet mapping studies.
  • Case study 3 Locating constrained, annotated
    Internet subgraphs.
  • Discussion

18
Internet mapping efforts
  • Goal Discover the Internet router-level
    topology
  • Vertices represent routers.
  • Edges connect routers that are one IP hop apart.

19
Most experimental traceroute studiesPansiot et
al 98, Govindan et al 00, Broido et al 01-05,
etc.
  • k sources Few active sources, strategically
    located.
  • m destinations Many passive destinations,
    globally dispersed.
  • Union of many traceroute paths.
  • (k,m)-traceroute study

Sources
Destinations
20
A thought experiment
  • Idea Simulate topology measurements on a random
    graph.
  • Generate a sparse Erdös-Rényi random graph,
    G(V,E). Each edge present independently with
    probability p Assign weights w(e) 1 e ,
    where e in
  • Pick k unique source nodes, uniformly at random
  • Pick m unique destination nodes, uniformly at
    random
  • Simulate traceroute from k sources to m
    destinations, i.e. learn shortest paths between k
    sources and m destinations.
  • Let G be union of shortest paths.
  • Ask How does G compare with G ?

21
Underlying Random Graph, G
log(PrXgtx)
MeasuredGraph, G
Underlying Graph N100000, p0.00015Measured
Graph k3, m1000
log(Degree)
G is a biased sample of G that looks
heavy-tailedAre heavy tails a measurement
artifact?
22
Are nodes sampled unevenly?
  • Conjecture Shortest path routing favors higher
    degree nodes ? nodes sampled unevenly
  • ValidationExamine true degrees of nodes in
    measured graph, G. Expect true degrees of nodes
    in G to be higher than degrees of nodes in G, on
    average.

23
Are edges sampled unevenly?
  • ConjectureEdges selected incident to a node in
    G not proportional to true degree.
  • ValidationFor each node in G, plot true degree
    vs. measured degree. If unbiased, ratio of true
    to measured degree should be constant. Points
    clustered around ycx line (clt1).

24
What does this suggest?
SummaryEdges are sampled unevenly by
(k,m)-traceroute methods.Edges close to the
source are sampled more often than edges further
away.
Intuitive Picture Neighborhood near sources is
well explored, but visibility of edges declines
sharply with hop distance from sources.
25
Non-Adaptive Scaling Laws
  • Choose k sources and m destinations at random.
  • Consider the subgraph G (V, E) induced by
    routes from R between all (source, dest) pairs.
  • How do expected values of V and E scale as
    a function of k and m for various graph models?
  • One special case for k 1 well understood.
  • Chuang-Sirbu multicast scaling law E m 0.8
  • Analysis in Phillips et al 99, van Mieghem et
    al 02
  • Formulations for general k are open.
  • Also of interest quantification of marginal
    utility of adding k1 st source or destination
    BBBC 01.

26
Statistical Test 1
C1 Are the highest-degree nodes near the
source? If so, then consistent with bias.
The 1 highest degree nodes occur at random with
distance to nearest source.
H0C1
  • Cut vertex set in half N (near) and F (far), by
    distance from nearest source.
  • Let v (0.01) V
  • k fraction of v that lies in N
  • Can bound likelihood k deviates from 1/2 using
    Chernoff bounds

Reject hypothesis with confidence 1-a if
27
Statistical Test 2
C2 Is the degree distribution of nodes near the
source different from those further away? If
so, consistent with bias.
Chi Square Test succeeds on degree distribution
for nodes near the source and far from the
source.
H0C2
Partition vertices across median distance N
(near) and F (far) Compare degree distribution
of nodes in N and F, using the Chi-Square Test

where O and E are observed and expected degree
frequencies and l is histogram bin size. Reject
hypothesis with confidence 1-a if
28
Testing C1
H0C1
The 1 highest degree nodes occur at random with
distance to source.
Pansiot-Grad 93 of the highest degree nodes are
in N Mercator 90 of the highest degree nodes
are in N Skitter 84 of the highest degree
nodes are in N
29
Testing C2
H0C2
30
Several possible explanations
  • Degree distribution is distance-independent, but
    sampling is biased.
  • Degree distribution is distance-dependent, and
    nodes further from the source really do have
    below-average degree.
  • Others?
  • In practice, it appears to be a combination of
    factors.

31
Other traceroute questions
  • Suppose you had the ability to conduct adaptive
    measurements (recently feasible, e.g.
    scriptroute).
  • How to maximize edge coverage on a fixed
    measurement budget?
  • Traceroute _at_ home (SIGMETRICS 05)
  • DIMES (INFOCOM 05)
  • AS-level traceroute SIGCOMM 03
  • Leverage to probe a hidden multigraph?

32
Modeling and algorithms mixed bag
  • Unknown hidden graph
  • Misconceptions about which caused us to bark
    up the wrong tree
  • Unclean abstraction
  • Awkward, inflexible probing process
  • probes interdependent on underlying graph
  • Amenable to parallelization
  • Power of adaptation not yet known

33
Outline
  • Motivating hidden graphs
  • Case study 1 Hidden matchings in genome
    sequencing.
  • Case study 2 Internet mapping studies.
  • Case study 3 Locating constrained, annotated
    Internet subgraphs.
  • Discussion

34
Experimental Methodologies
  • Simulation
  • Blank slate for crafting experiments
  • Fine-grained control, specifying all details
  • No external surprises, not especially realistic
  • Emulation
  • All the benefits of simulation, plus
  • running real protocols on real systems
  • Internet experimentation
  • Even more realistic
  • Much harder to set up, control experiments

35
Controlled Internet Experimentation
  • Our question
  • Can we bridge over some of the attractive
    features of simulation and emulation into
    wide-area testbed or overlay experimentation?
  • Towards an answer
  • Which services would be useful?
  • Outline design of a set of interesting services.
  • Todays talk
  • specify parameters of an experiment on a blank
    slate
  • locate one or more sub-topologies matching
    specification

36
Annotated topologies problem statement
  • User specifies an envisioned target topology T
  • edges and bounds on their attributes
  • or more interesting only path attributes (RTTs)
  • Then, given an overlay network G whose
  • vertices are known in advance and
  • whose paths have measurable, multi-dimensional
    attributes not known in advance
  • Conduct a set of adaptive probes to
  • locate a hidden instance (feasible embedding) of
    T into G respecting constraints.
  • more generally sample from feasible embeddings

37
Specifying Topologies
  • N nodes in testbed, k nodes in specification
  • k x k constraint matrix C ci,j
  • Entry ci,j constrains the end-to-end path between
    embedding of virtual nodes i and j.
  • For example, place bounds on RTTs
  • ci,j li,j, hi,j represents lower and upper
    bounds on target RTT.
  • Constraints can be multi-dimensional.
  • Constraints can also be placed on nodes.
  • More complex specifications possible...

38
Feasible Embeddings
  • Defn A feasible embedding is a mapping f such
    that for all i, j where f(i) x and f(j) y
  • li,j d (x, y) hi,j
  • Do not need to know d (x, y) exactly, only that
  • li,j l(x, y) d (x, y) h (x, y) hi,j
  • Key point Testbed need not be exhaustively
    characterized, only sufficiently well to embed.

39
Hardness
  • Finding an embedding is as hard as subgraph
    isomorphism (NP-complete)
  • Counting or sampling from set of feasible
    embeddings is P-hard.
  • Approximation algorithms are not much better.

40
Current Best-Practice
  • Brute force search CBM 03, HotNets-II.
  • No joke.
  • Situation is not quite as dire as it sounds.
  • Several methods for pruning the search tree.
  • Adaptive measurement heuristics.
  • Many (almost all?) user problem instances not
    near boundary of solubility and insolubility.
  • Prototype service on PlanetLab
  • Off-line searches up to thousands of nodes.
  • On-line searches up to hundreds of nodes.

41
Current Best-Practice (cont.)
  • Good news many of the hardness results are
    based on (unrealistic?) modeling assumptions
  • Bad news better models for annotated topologies
    are notably absent
  • Why?
  • Measurements that might assist in model-building
    are just getting underway.
  • Capturing the practical issues in a parsimonious
    model is a formidable challenge.

42
Modeling and algorithms virgin
territory
  • Hidden graph is dynamic
  • Abstraction is reasonably clean
  • Combinatorial optimization issues may pose thorny
    problems for analysis
  • Model-based approaches could help

43
Takeaway messages
  • Numerous hidden graphs in science more emerging
    as engineered artifacts.
  • Principled measurement/modeling/validation will
    be needed.
  • Forums for discussion and dissemination of ideas
    across disciplines will help.

44
References cited (p. 1 of 2)
  • AA 04 N. Alon and V. Asodi, Learning a hidden
    subgraph, Proc. of 31st ICALP, 2004.
  • AC 04 D. Angluin and J. Chen, Learning a
    hidden graph using O(log n) queries per edge,
    COLT 2004.
  • ABK 02 N. Alon, R. Beigel, S. Kasif, S.
    Rudich and B. Sudakov, Learning a hidden
    matching Combinatorial identification of hidden
    matchings with applications to whole geonme
    sequencing, SIAM Journal on Computing, 2004.
  • ACKM 05 D. Achlioptas, A. Clauset, D. Kempe
    and C. Moore, On the bias of traceroute
    sampling, Proc. of ACM STOC 2005.
  • BAA 01 R. Beigel, N. Alon, S. Apaydin, L.
    Fortnow and S. Kasif, An optimal procedure for
    gap closing in whole genome shotgun sequencing,
    Proc. of ACM RECOMB 2001.
  • BBBC 01 P. Barford, A. Bestavros, J. Byers and
    M. Crovella, On the marginal utility of network
    topology measurements, Proc. of 1st ACM SIGCOMM
    Internet Measurement Workshop, 2001.
  • BC 01 (Skitter) A. Broido and K. Claffy,
    Connectivity of IP graphs, Proc. of SPIE ITCom,
    August 2001.
  • CBM 03 J. Considine, J. Byers and K.
    Mayer-Patel, A constraint satisfaction approach
    to testbed embedding services, Proc. of ACM
    HotNets Workshop, 2003.
  • DIMES The DIMES project. www.netdimes.org.
  • DRFC 05 B. Donnet, P. Raoult, T. Friedman, and
    M. Crovella, Efficient algorithms for
    large-scale topology discovery, to appear in
    Proc. of ACM SIGMETRICS, 2005.
  • FFF 99 M. Faloutsos, P. Faloutsos and C.
    Faloutsos, On power-law relationships of the
    Internet topology, Proc. of ACM SIGCOMM 99.

45
References cited
  • GK 98 V. Grebinski and G. Kucherov,
    Reconstructing a Hamiltonian cycle by querying
    the graph Application to DNA physical mapping,
    Discrete Applied Math. 88 (1998).
  • GT 00 (Mercator) R. Govindan and H.
    Tangmunarunkit, Heuristics for Internet map
    discovery, Proc. of IEEE INFOCOM 2000.
  • LBCX 03 A. Lakhina, J. Byers, M. Crovella and
    P. Xie, Sampling biases in IP topology
    measurements, Proc. of IEEE INFOCOM 2003.
  • MHH 01 P. van Mieghem, G. Hooghiemstra, R. van
    der Hofstad, On the efficiency of multicast,
    IEEE/ACM Transactions on Networking, May 2001.
  • PG 98 J. Pansiot and D. Grad, On routes and
    multicast trees in the Internet, ACM Computer
    Communications Review, 28(1), 1998.
  • PST 99 G. Philips, S. Shenker and H.
    Tangmunarunkit, Scaling of multicast trees
    comments on the Chuang-Sirbu scaling law, Proc.
    Of ACM SIGCOMM 99.
  • SMW 02 N. Spring, R. Mahajan and D. Wetherall,
    Measuring ISP topologies with Rocketfuel,,
    Proc. of ACM SIGCOMM 2002.
  • SWM 05 M. Stumpf, C. Wiuf and R. May, Subnets
    of scale-free networks are not scale-free
    Sampling properties of networks, PNAS 102(12),
    March 22, 2005.
Write a Comment
User Comments (0)
About PowerShow.com